14.8.3 后端MySQL故障检测与自动切换

后端MySQL故障检测与自动切换#

原理与架构草图#

文章图片

故障检测目标与原则#

  • 目标:快速发现不可用节点、减少业务中断、避免误切换导致的数据风险
  • 原则:写库可用性优先于读库、检测需结合复制延迟与只读状态
  • 判定核心:连接/查询探测 + read_only/复制延迟 + 业务可达性

安装与基础监控账号示例#

# 1) 安装(以Ubuntu为例)
sudo apt-get update
sudo apt-get install -y proxysql mysql-client

# 2) 在后端MySQL创建监控账号(最小权限)
mysql -uroot -p -h 192.168.10.10 <<'SQL'
CREATE USER 'pxy_monitor'@'%' IDENTIFIED BY 'Mon!Pass123';
GRANT USAGE, REPLICATION CLIENT ON *.* TO 'pxy_monitor'@'%';
FLUSH PRIVILEGES;
SQL

# 3) 验证监控账号可连接与查询
mysql -upxy_monitor -pMon!Pass123 -h 192.168.10.10 -e "SELECT 1;"

检测机制与配置要点(含命令解释)#

-- 进入ProxySQL管理端(默认6032)
mysql -uadmin -padmin -h 127.0.0.1 -P6032

-- 1) 设置监控账号与检测间隔
SET mysql-monitor_username='pxy_monitor';
SET mysql-monitor_password='Mon!Pass123';
SET mysql-monitor_ping_interval=2000;          -- 2s ping
SET mysql-monitor_connect_interval=2000;       -- 2s connect
SET mysql-monitor_read_only_interval=2000;     -- 2s read_only检测
SET mysql-monitor_replication_lag_interval=3000; -- 3s 复制延迟
LOAD MYSQL VARIABLES TO RUNTIME;
SAVE MYSQL VARIABLES TO DISK;

-- 2) 添加后端节点与读写组
INSERT INTO mysql_servers(hostgroup_id,hostname,port,weight,max_connections)
VALUES
(10,'192.168.10.10',3306,100,1000),  -- 写组主库
(20,'192.168.10.11',3306,50,1000),   -- 读组从库
(20,'192.168.10.12',3306,50,1000);   -- 读组从库
LOAD MYSQL SERVERS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;

-- 3) 查看监控结果(命令解释:查看节点连接与延迟)
SELECT * FROM stats_mysql_connection_pool;
SELECT * FROM monitor.mysql_server_connect_log ORDER BY time_start_us DESC LIMIT 5;
SELECT * FROM monitor.mysql_server_ping_log ORDER BY time_start_us DESC LIMIT 5;
SELECT * FROM monitor.mysql_server_read_only_log ORDER BY time_start_us DESC LIMIT 5;
SELECT * FROM monitor.mysql_server_replication_lag_log ORDER BY time_start_us DESC LIMIT 5;

自动切换流程与触发逻辑#

  1. 监控线程检测主库连接失败或 read_only=1
  2. 将主库标记为 OFFLINE_SOFT,停止新连接
  3. 从读组中挑选延迟最小、read_only=0 的从库
  4. 提升为写组(hostgroup 10),刷新路由
  5. 旧主恢复后降级为读组或人工回切

自动切换示例(含脚本)#

说明:示例采用外部脚本配合ProxySQL执行切换;在生产环境建议结合 Orchestrator/MHA。

# /usr/local/bin/pxy_failover.sh
#!/usr/bin/env bash
set -e

ADMIN="mysql -uadmin -padmin -h127.0.0.1 -P6032 -Nse"
PRIMARY="192.168.10.10"
CANDIDATE="192.168.10.11"

# 1) 检测主库是否不可用
if ! mysqladmin -h "${PRIMARY}" -upxy_monitor -pMon\!Pass123 ping >/dev/null 2>&1; then
  echo "[WARN] Primary down: ${PRIMARY}"

  # 2) 将旧主下线
  $ADMIN "UPDATE mysql_servers SET status='OFFLINE_SOFT' WHERE hostname='${PRIMARY}' AND hostgroup_id=10;"
  $ADMIN "LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;"

  # 3) 提升候选从库为写组
  $ADMIN "DELETE FROM mysql_servers WHERE hostname='${CANDIDATE}' AND hostgroup_id=20;"
  $ADMIN "INSERT INTO mysql_servers(hostgroup_id,hostname,port,weight,max_connections)
          VALUES(10,'${CANDIDATE}',3306,100,1000);"
  $ADMIN "LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;"

  echo "[OK] Promoted ${CANDIDATE} to writer hostgroup 10"
fi

复制状态与延迟策略(含检测语句)#

-- 在候选从库检查延迟与只读
SHOW SLAVE STATUS\G
-- 关键字段解释:
-- Seconds_Behind_Master: 复制延迟(为0或小于阈值)
-- Slave_SQL_Running/Slave_IO_Running: 必须为Yes
-- read_only: 需为OFF(0)才能作为写库

-- 在主库检查是否意外只读
SHOW VARIABLES LIKE 'read_only';

典型故障场景与排错#

  • 主库宕机
  • 排错命令:
    bash mysqladmin -h 192.168.10.10 -upxy_monitor -pMon\!Pass123 ping journalctl -u mysql -n 100 --no-pager
  • 处理:执行自动切换脚本或人工切换写组
  • 主库网络隔离(脑裂风险)
  • 排错命令:
    bash ping -c 3 192.168.10.10 ip route get 192.168.10.10
  • 处理:结合外部仲裁,避免双写
  • 复制中断
  • 排错命令:
    sql SHOW SLAVE STATUS\G
  • 处理:修复复制,再允许晋升
  • 监控误判(频繁切换)
  • 排错命令:
    sql SELECT * FROM monitor.mysql_server_ping_log ORDER BY time_start_us DESC LIMIT 10;
  • 处理:增大 mysql-monitor_*_interval 或超时参数

演练与验证(可复制步骤)#

# 1) 模拟主库不可用(在主库上)
sudo systemctl stop mysql

# 2) 观察ProxySQL监控日志
mysql -uadmin -padmin -h127.0.0.1 -P6032 -e \
"SELECT * FROM monitor.mysql_server_connect_log ORDER BY time_start_us DESC LIMIT 5;"

# 3) 执行切换脚本
sudo /usr/local/bin/pxy_failover.sh

# 4) 验证写组主库是否变更
mysql -uadmin -padmin -h127.0.0.1 -P6032 -e \
"SELECT hostgroup_id,hostname,status FROM mysql_servers ORDER BY hostgroup_id,hostname;"

练习#

  1. mysql-monitor_ping_interval 调整为 1000ms,观察监控日志频率变化
  2. 制造从库延迟(停止SQL线程),验证其不会被选为写库
  3. 编写脚本自动检测 read_only 并输出告警

最佳实践#

  • 监控账号最小权限、密码轮换
  • 延迟阈值与只读检测必须同时启用
  • 定期演练切换与回切,记录步骤与时长
  • 生产环境建议接入 Orchestrator/MHA 等统一决策组件