14.7.8 运维自动化与巡检脚本

围绕 ProxySQL 的运维自动化与巡检脚本目标是实现“可持续、可复用、可追溯”的日常运维闭环,核心包括定时采集、配置核对、异常处置与报告输出。建议以统一目录与标准输出规范组织脚本,便于与监控平台、告警系统、工单系统对接。

一、原理草图:巡检数据流与闭环

文章图片

二、脚本目录与依赖安装(示例)
- 依赖:mysql-client用于连接 Admin 接口;jq用于 JSON 处理

# Debian/Ubuntu
sudo apt-get update
sudo apt-get install -y mysql-client jq

# RHEL/CentOS
sudo yum install -y mysql jq

目录结构建议:

/opt/proxysql/ops/
├── bin/
│   ├── health_check.sh
│   ├── pool_status.sh
│   ├── rule_diff.sh
│   ├── user_audit.sh
│   └── backup_config.sh
├── conf/
│   ├── env.conf
│   └── metrics.conf
├── logs/
└── reports/

三、统一环境配置(示例文件)
/opt/proxysql/ops/conf/env.conf

PROXYSQL_ADMIN_HOST=127.0.0.1
PROXYSQL_ADMIN_PORT=6032
PROXYSQL_ADMIN_USER=monitor
PROXYSQL_ADMIN_PASS='StrongPass!'
REPORT_DIR=/opt/proxysql/ops/reports
LOG_DIR=/opt/proxysql/ops/logs

四、关键巡检脚本与命令解释

1)健康检查脚本(输出 JSON)
/opt/proxysql/ops/bin/health_check.sh

#!/usr/bin/env bash
set -euo pipefail
source /opt/proxysql/ops/conf/env.conf

ts=$(date +%F_%H%M%S)
outfile="${REPORT_DIR}/health_${ts}.json"

SQL="
SELECT
  (SELECT Variable_Value FROM stats_mysql_global WHERE Variable_Name='ProxySQL_Uptime') AS uptime,
  (SELECT Variable_Value FROM stats_mysql_global WHERE Variable_Name='Questions') AS questions,
  (SELECT SUM(ConnUsed) FROM stats_mysql_connection_pool) AS conn_used,
  (SELECT SUM(ConnFree) FROM stats_mysql_connection_pool) AS conn_free,
  (SELECT SUM(MaxConnUsed) FROM stats_mysql_connection_pool) AS max_conn_used;
"

mysql -h"${PROXYSQL_ADMIN_HOST}" -P"${PROXYSQL_ADMIN_PORT}" \
  -u"${PROXYSQL_ADMIN_USER}" -p"${PROXYSQL_ADMIN_PASS}" \
  -Nse "$SQL" | awk '{
  printf("{\"uptime\":%s,\"questions\":%s,\"conn_used\":%s,\"conn_free\":%s,\"max_conn_used\":%s}\n",$1,$2,$3,$4,$5)
}' | tee "$outfile"

# 退出码说明:0成功;非0失败(供监控系统判断)

命令解释
- mysql -Nse-N去掉表头,-s精简输出,-e执行SQL
- awk:将输出映射为 JSON

2)连接池与后端健康
/opt/proxysql/ops/bin/pool_status.sh

#!/usr/bin/env bash
set -euo pipefail
source /opt/proxysql/ops/conf/env.conf

SQL="
SELECT hostgroup, srv_host, status, ConnUsed, ConnFree, ConnErr, MaxConnUsed
FROM stats_mysql_connection_pool;
"

mysql -h"${PROXYSQL_ADMIN_HOST}" -P"${PROXYSQL_ADMIN_PORT}" \
  -u"${PROXYSQL_ADMIN_USER}" -p"${PROXYSQL_ADMIN_PASS}" \
  -e "$SQL" | tee "${REPORT_DIR}/pool_status_$(date +%F_%H%M%S).txt"

解释
- statusONLINE/OFFLINE_SOFT/OFFLINE_HARD/SHUNNED
- ConnErr大于0需排查后端连接异常

3)规则一致性校验(与 Git 版本对比)
/opt/proxysql/ops/bin/rule_diff.sh

#!/usr/bin/env bash
set -euo pipefail
source /opt/proxysql/ops/conf/env.conf

# 导出运行态规则
mysql -h"${PROXYSQL_ADMIN_HOST}" -P"${PROXYSQL_ADMIN_PORT}" \
  -u"${PROXYSQL_ADMIN_USER}" -p"${PROXYSQL_ADMIN_PASS}" \
  -e "SELECT rule_id, active, match_digest, destination_hostgroup, apply FROM mysql_query_rules ORDER BY rule_id;" \
  > /tmp/runtime_rules.txt

# 对比版本库中的基线规则
diff -u /opt/proxysql/ops/conf/rules_baseline.txt /tmp/runtime_rules.txt \
  | tee "${REPORT_DIR}/rule_diff_$(date +%F_%H%M%S).diff"

四、配置备份与热加载检查
/opt/proxysql/ops/bin/backup_config.sh

#!/usr/bin/env bash
set -euo pipefail
source /opt/proxysql/ops/conf/env.conf

ts=$(date +%F_%H%M%S)
mysql -h"${PROXYSQL_ADMIN_HOST}" -P"${PROXYSQL_ADMIN_PORT}" \
  -u"${PROXYSQL_ADMIN_USER}" -p"${PROXYSQL_ADMIN_PASS}" \
  -e "SELECT * FROM mysql_users;" > "${REPORT_DIR}/mysql_users_${ts}.txt"

mysql -h"${PROXYSQL_ADMIN_HOST}" -P"${PROXYSQL_ADMIN_PORT}" \
  -u"${PROXYSQL_ADMIN_USER}" -p"${PROXYSQL_ADMIN_PASS}" \
  -e "SAVE MYSQL USERS TO DISK;" \
  && echo "SAVE MYSQL USERS TO DISK OK" \
  || echo "SAVE MYSQL USERS TO DISK FAIL"

五、定时执行与输出示例

1)Cron 定时任务

crontab -e
# 每5分钟健康检查
*/5 * * * * /opt/proxysql/ops/bin/health_check.sh >> /opt/proxysql/ops/logs/health.log 2>&1
# 每小时连接池巡检
0 * * * * /opt/proxysql/ops/bin/pool_status.sh >> /opt/proxysql/ops/logs/pool.log 2>&1
# 每天规则对比
30 1 * * * /opt/proxysql/ops/bin/rule_diff.sh >> /opt/proxysql/ops/logs/rule_diff.log 2>&1

2)预期输出(健康检查 JSON)

{"uptime":86400,"questions":12000000,"conn_used":120,"conn_free":300,"max_conn_used":240}

六、排错与故障定位(常见场景与命令)

1)无法连接 Admin 接口
- 检查端口与监听:

ss -lntp | grep 6032
  • 检查账号权限:
mysql -h127.0.0.1 -P6032 -uadmin -p -e "SELECT * FROM mysql_users;"

2)连接池异常(ConnErr持续增长)
- 查看后端状态:

mysql -h127.0.0.1 -P6032 -uadmin -p \
  -e "SELECT hostgroup, srv_host, status, ConnErr FROM stats_mysql_connection_pool;"
  • 处置示例:将异常后端下线后再恢复
UPDATE mysql_servers SET status='OFFLINE_SOFT' WHERE hostname='10.0.0.12';
LOAD MYSQL SERVERS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;

3)规则加载失败
- 检查运行态与持久化差异:

SELECT COUNT(*) FROM mysql_query_rules;
SELECT COUNT(*) FROM runtime_mysql_query_rules;

七、自动化告警与自愈钩子示例(伪代码)

# 判断连接错误阈值
if [ "$ConnErr" -gt 100 ]; then
  echo "ALERT: ConnErr too high"
  # 自动下线异常节点
  mysql -h127.0.0.1 -P6032 -uadmin -p \
    -e "UPDATE mysql_servers SET status='OFFLINE_SOFT' WHERE hostname='10.0.0.12';
        LOAD MYSQL SERVERS TO RUNTIME;"
fi

八、练习
1. 编写脚本统计每个 hostgroup 的 ConnUsed 总和并输出 JSON。
2. 修改 rule_diff.sh,在差异不为空时向 webhook 推送告警。
3. 模拟后端节点异常,观察 stats_mysql_connection_pool 的状态变化,并记录处置步骤与恢复时间。
4. 将 health_check.sh 的输出接入 Prometheus(文本格式)并实现可视化。