14.8.3 后端MySQL故障检测与自动切换
后端MySQL故障检测与自动切换#
原理与架构草图#
故障检测目标与原则#
- 目标:快速发现不可用节点、减少业务中断、避免误切换导致的数据风险
- 原则:写库可用性优先于读库、检测需结合复制延迟与只读状态
- 判定核心:连接/查询探测 + read_only/复制延迟 + 业务可达性
安装与基础监控账号示例#
# 1) 安装(以Ubuntu为例)
sudo apt-get update
sudo apt-get install -y proxysql mysql-client
# 2) 在后端MySQL创建监控账号(最小权限)
mysql -uroot -p -h 192.168.10.10 <<'SQL'
CREATE USER 'pxy_monitor'@'%' IDENTIFIED BY 'Mon!Pass123';
GRANT USAGE, REPLICATION CLIENT ON *.* TO 'pxy_monitor'@'%';
FLUSH PRIVILEGES;
SQL
# 3) 验证监控账号可连接与查询
mysql -upxy_monitor -pMon!Pass123 -h 192.168.10.10 -e "SELECT 1;"
检测机制与配置要点(含命令解释)#
-- 进入ProxySQL管理端(默认6032)
mysql -uadmin -padmin -h 127.0.0.1 -P6032
-- 1) 设置监控账号与检测间隔
SET mysql-monitor_username='pxy_monitor';
SET mysql-monitor_password='Mon!Pass123';
SET mysql-monitor_ping_interval=2000; -- 2s ping
SET mysql-monitor_connect_interval=2000; -- 2s connect
SET mysql-monitor_read_only_interval=2000; -- 2s read_only检测
SET mysql-monitor_replication_lag_interval=3000; -- 3s 复制延迟
LOAD MYSQL VARIABLES TO RUNTIME;
SAVE MYSQL VARIABLES TO DISK;
-- 2) 添加后端节点与读写组
INSERT INTO mysql_servers(hostgroup_id,hostname,port,weight,max_connections)
VALUES
(10,'192.168.10.10',3306,100,1000), -- 写组主库
(20,'192.168.10.11',3306,50,1000), -- 读组从库
(20,'192.168.10.12',3306,50,1000); -- 读组从库
LOAD MYSQL SERVERS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;
-- 3) 查看监控结果(命令解释:查看节点连接与延迟)
SELECT * FROM stats_mysql_connection_pool;
SELECT * FROM monitor.mysql_server_connect_log ORDER BY time_start_us DESC LIMIT 5;
SELECT * FROM monitor.mysql_server_ping_log ORDER BY time_start_us DESC LIMIT 5;
SELECT * FROM monitor.mysql_server_read_only_log ORDER BY time_start_us DESC LIMIT 5;
SELECT * FROM monitor.mysql_server_replication_lag_log ORDER BY time_start_us DESC LIMIT 5;
自动切换流程与触发逻辑#
- 监控线程检测主库连接失败或
read_only=1 - 将主库标记为
OFFLINE_SOFT,停止新连接 - 从读组中挑选延迟最小、
read_only=0的从库 - 提升为写组(hostgroup 10),刷新路由
- 旧主恢复后降级为读组或人工回切
自动切换示例(含脚本)#
说明:示例采用外部脚本配合ProxySQL执行切换;在生产环境建议结合 Orchestrator/MHA。
# /usr/local/bin/pxy_failover.sh
#!/usr/bin/env bash
set -e
ADMIN="mysql -uadmin -padmin -h127.0.0.1 -P6032 -Nse"
PRIMARY="192.168.10.10"
CANDIDATE="192.168.10.11"
# 1) 检测主库是否不可用
if ! mysqladmin -h "${PRIMARY}" -upxy_monitor -pMon\!Pass123 ping >/dev/null 2>&1; then
echo "[WARN] Primary down: ${PRIMARY}"
# 2) 将旧主下线
$ADMIN "UPDATE mysql_servers SET status='OFFLINE_SOFT' WHERE hostname='${PRIMARY}' AND hostgroup_id=10;"
$ADMIN "LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;"
# 3) 提升候选从库为写组
$ADMIN "DELETE FROM mysql_servers WHERE hostname='${CANDIDATE}' AND hostgroup_id=20;"
$ADMIN "INSERT INTO mysql_servers(hostgroup_id,hostname,port,weight,max_connections)
VALUES(10,'${CANDIDATE}',3306,100,1000);"
$ADMIN "LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;"
echo "[OK] Promoted ${CANDIDATE} to writer hostgroup 10"
fi
复制状态与延迟策略(含检测语句)#
-- 在候选从库检查延迟与只读
SHOW SLAVE STATUS\G
-- 关键字段解释:
-- Seconds_Behind_Master: 复制延迟(为0或小于阈值)
-- Slave_SQL_Running/Slave_IO_Running: 必须为Yes
-- read_only: 需为OFF(0)才能作为写库
-- 在主库检查是否意外只读
SHOW VARIABLES LIKE 'read_only';
典型故障场景与排错#
- 主库宕机
- 排错命令:
bash mysqladmin -h 192.168.10.10 -upxy_monitor -pMon\!Pass123 ping journalctl -u mysql -n 100 --no-pager - 处理:执行自动切换脚本或人工切换写组
- 主库网络隔离(脑裂风险)
- 排错命令:
bash ping -c 3 192.168.10.10 ip route get 192.168.10.10 - 处理:结合外部仲裁,避免双写
- 复制中断
- 排错命令:
sql SHOW SLAVE STATUS\G - 处理:修复复制,再允许晋升
- 监控误判(频繁切换)
- 排错命令:
sql SELECT * FROM monitor.mysql_server_ping_log ORDER BY time_start_us DESC LIMIT 10; - 处理:增大
mysql-monitor_*_interval或超时参数
演练与验证(可复制步骤)#
# 1) 模拟主库不可用(在主库上)
sudo systemctl stop mysql
# 2) 观察ProxySQL监控日志
mysql -uadmin -padmin -h127.0.0.1 -P6032 -e \
"SELECT * FROM monitor.mysql_server_connect_log ORDER BY time_start_us DESC LIMIT 5;"
# 3) 执行切换脚本
sudo /usr/local/bin/pxy_failover.sh
# 4) 验证写组主库是否变更
mysql -uadmin -padmin -h127.0.0.1 -P6032 -e \
"SELECT hostgroup_id,hostname,status FROM mysql_servers ORDER BY hostgroup_id,hostname;"
练习#
- 将
mysql-monitor_ping_interval调整为 1000ms,观察监控日志频率变化 - 制造从库延迟(停止SQL线程),验证其不会被选为写库
- 编写脚本自动检测
read_only并输出告警
最佳实践#
- 监控账号最小权限、密码轮换
- 延迟阈值与只读检测必须同时启用
- 定期演练切换与回切,记录步骤与时长
- 生产环境建议接入 Orchestrator/MHA 等统一决策组件