14.7.8 运维自动化与巡检脚本
围绕 ProxySQL 的运维自动化与巡检脚本目标是实现“可持续、可复用、可追溯”的日常运维闭环,核心包括定时采集、配置核对、异常处置与报告输出。建议以统一目录与标准输出规范组织脚本,便于与监控平台、告警系统、工单系统对接。
一、原理草图:巡检数据流与闭环
二、脚本目录与依赖安装(示例)
- 依赖:mysql-client用于连接 Admin 接口;jq用于 JSON 处理
# Debian/Ubuntu
sudo apt-get update
sudo apt-get install -y mysql-client jq
# RHEL/CentOS
sudo yum install -y mysql jq
目录结构建议:
/opt/proxysql/ops/
├── bin/
│ ├── health_check.sh
│ ├── pool_status.sh
│ ├── rule_diff.sh
│ ├── user_audit.sh
│ └── backup_config.sh
├── conf/
│ ├── env.conf
│ └── metrics.conf
├── logs/
└── reports/
三、统一环境配置(示例文件)
/opt/proxysql/ops/conf/env.conf
PROXYSQL_ADMIN_HOST=127.0.0.1
PROXYSQL_ADMIN_PORT=6032
PROXYSQL_ADMIN_USER=monitor
PROXYSQL_ADMIN_PASS='StrongPass!'
REPORT_DIR=/opt/proxysql/ops/reports
LOG_DIR=/opt/proxysql/ops/logs
四、关键巡检脚本与命令解释
1)健康检查脚本(输出 JSON)
/opt/proxysql/ops/bin/health_check.sh
#!/usr/bin/env bash
set -euo pipefail
source /opt/proxysql/ops/conf/env.conf
ts=$(date +%F_%H%M%S)
outfile="${REPORT_DIR}/health_${ts}.json"
SQL="
SELECT
(SELECT Variable_Value FROM stats_mysql_global WHERE Variable_Name='ProxySQL_Uptime') AS uptime,
(SELECT Variable_Value FROM stats_mysql_global WHERE Variable_Name='Questions') AS questions,
(SELECT SUM(ConnUsed) FROM stats_mysql_connection_pool) AS conn_used,
(SELECT SUM(ConnFree) FROM stats_mysql_connection_pool) AS conn_free,
(SELECT SUM(MaxConnUsed) FROM stats_mysql_connection_pool) AS max_conn_used;
"
mysql -h"${PROXYSQL_ADMIN_HOST}" -P"${PROXYSQL_ADMIN_PORT}" \
-u"${PROXYSQL_ADMIN_USER}" -p"${PROXYSQL_ADMIN_PASS}" \
-Nse "$SQL" | awk '{
printf("{\"uptime\":%s,\"questions\":%s,\"conn_used\":%s,\"conn_free\":%s,\"max_conn_used\":%s}\n",$1,$2,$3,$4,$5)
}' | tee "$outfile"
# 退出码说明:0成功;非0失败(供监控系统判断)
命令解释
- mysql -Nse:-N去掉表头,-s精简输出,-e执行SQL
- awk:将输出映射为 JSON
2)连接池与后端健康
/opt/proxysql/ops/bin/pool_status.sh
#!/usr/bin/env bash
set -euo pipefail
source /opt/proxysql/ops/conf/env.conf
SQL="
SELECT hostgroup, srv_host, status, ConnUsed, ConnFree, ConnErr, MaxConnUsed
FROM stats_mysql_connection_pool;
"
mysql -h"${PROXYSQL_ADMIN_HOST}" -P"${PROXYSQL_ADMIN_PORT}" \
-u"${PROXYSQL_ADMIN_USER}" -p"${PROXYSQL_ADMIN_PASS}" \
-e "$SQL" | tee "${REPORT_DIR}/pool_status_$(date +%F_%H%M%S).txt"
解释
- status:ONLINE/OFFLINE_SOFT/OFFLINE_HARD/SHUNNED
- ConnErr大于0需排查后端连接异常
3)规则一致性校验(与 Git 版本对比)
/opt/proxysql/ops/bin/rule_diff.sh
#!/usr/bin/env bash
set -euo pipefail
source /opt/proxysql/ops/conf/env.conf
# 导出运行态规则
mysql -h"${PROXYSQL_ADMIN_HOST}" -P"${PROXYSQL_ADMIN_PORT}" \
-u"${PROXYSQL_ADMIN_USER}" -p"${PROXYSQL_ADMIN_PASS}" \
-e "SELECT rule_id, active, match_digest, destination_hostgroup, apply FROM mysql_query_rules ORDER BY rule_id;" \
> /tmp/runtime_rules.txt
# 对比版本库中的基线规则
diff -u /opt/proxysql/ops/conf/rules_baseline.txt /tmp/runtime_rules.txt \
| tee "${REPORT_DIR}/rule_diff_$(date +%F_%H%M%S).diff"
四、配置备份与热加载检查
/opt/proxysql/ops/bin/backup_config.sh
#!/usr/bin/env bash
set -euo pipefail
source /opt/proxysql/ops/conf/env.conf
ts=$(date +%F_%H%M%S)
mysql -h"${PROXYSQL_ADMIN_HOST}" -P"${PROXYSQL_ADMIN_PORT}" \
-u"${PROXYSQL_ADMIN_USER}" -p"${PROXYSQL_ADMIN_PASS}" \
-e "SELECT * FROM mysql_users;" > "${REPORT_DIR}/mysql_users_${ts}.txt"
mysql -h"${PROXYSQL_ADMIN_HOST}" -P"${PROXYSQL_ADMIN_PORT}" \
-u"${PROXYSQL_ADMIN_USER}" -p"${PROXYSQL_ADMIN_PASS}" \
-e "SAVE MYSQL USERS TO DISK;" \
&& echo "SAVE MYSQL USERS TO DISK OK" \
|| echo "SAVE MYSQL USERS TO DISK FAIL"
五、定时执行与输出示例
1)Cron 定时任务
crontab -e
# 每5分钟健康检查
*/5 * * * * /opt/proxysql/ops/bin/health_check.sh >> /opt/proxysql/ops/logs/health.log 2>&1
# 每小时连接池巡检
0 * * * * /opt/proxysql/ops/bin/pool_status.sh >> /opt/proxysql/ops/logs/pool.log 2>&1
# 每天规则对比
30 1 * * * /opt/proxysql/ops/bin/rule_diff.sh >> /opt/proxysql/ops/logs/rule_diff.log 2>&1
2)预期输出(健康检查 JSON)
{"uptime":86400,"questions":12000000,"conn_used":120,"conn_free":300,"max_conn_used":240}
六、排错与故障定位(常见场景与命令)
1)无法连接 Admin 接口
- 检查端口与监听:
ss -lntp | grep 6032
- 检查账号权限:
mysql -h127.0.0.1 -P6032 -uadmin -p -e "SELECT * FROM mysql_users;"
2)连接池异常(ConnErr持续增长)
- 查看后端状态:
mysql -h127.0.0.1 -P6032 -uadmin -p \
-e "SELECT hostgroup, srv_host, status, ConnErr FROM stats_mysql_connection_pool;"
- 处置示例:将异常后端下线后再恢复
UPDATE mysql_servers SET status='OFFLINE_SOFT' WHERE hostname='10.0.0.12';
LOAD MYSQL SERVERS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;
3)规则加载失败
- 检查运行态与持久化差异:
SELECT COUNT(*) FROM mysql_query_rules;
SELECT COUNT(*) FROM runtime_mysql_query_rules;
七、自动化告警与自愈钩子示例(伪代码)
# 判断连接错误阈值
if [ "$ConnErr" -gt 100 ]; then
echo "ALERT: ConnErr too high"
# 自动下线异常节点
mysql -h127.0.0.1 -P6032 -uadmin -p \
-e "UPDATE mysql_servers SET status='OFFLINE_SOFT' WHERE hostname='10.0.0.12';
LOAD MYSQL SERVERS TO RUNTIME;"
fi
八、练习
1. 编写脚本统计每个 hostgroup 的 ConnUsed 总和并输出 JSON。
2. 修改 rule_diff.sh,在差异不为空时向 webhook 推送告警。
3. 模拟后端节点异常,观察 stats_mysql_connection_pool 的状态变化,并记录处置步骤与恢复时间。
4. 将 health_check.sh 的输出接入 Prometheus(文本格式)并实现可视化。