6.9.4 高可用切换与故障自愈工具

在MySQL高可用场景中,切换与自愈工具的目标是缩短RTO、降低人工介入、保证数据一致性与业务连续性。常见工具包括:MHA(基于复制的主从切换)、Orchestrator(拓扑感知与自动修复)、MySQL Router+InnoDB Cluster(MGR生态)、ProxySQL/HAProxy(读写分离与故障隔离),以及与Keepalived配合的VIP漂移方案。选型时需评估复制模式(异步/半同步/组复制)、故障检测机制、切换时数据丢失窗口、与业务访问层兼容性。

高可用切换核心流程包括:故障检测、主库隔离、从库选主、位点对齐、提升主库、业务路由切换与故障确认。工具配置的关键点是统一复制账号与权限、可靠的心跳检测、日志与GTID策略、故障隔离(如自动下线故障节点)、以及对“脑裂”的防护。以Orchestrator为例,应开启拓扑自动发现与raft共识;以MHA为例需重点关注ssh互信、脚本切换安全与binlog保全;MGR需确保成员一致性、仲裁与隔离策略。

原理草图(Orchestrator + ProxySQL + Keepalived):

文章图片

示例:安装与基础配置(以Orchestrator为例)

# 1) 安装(以Debian/Ubuntu为例)
sudo apt-get update
sudo apt-get install -y orchestrator-client orchestrator

# 2) 创建MySQL复制账号(在主库执行)
mysql -uroot -p -e "
CREATE USER 'repl'@'%' IDENTIFIED BY 'Repl@123';
GRANT REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'repl'@'%';
FLUSH PRIVILEGES;
"

# 3) 配置Orchestrator(/etc/orchestrator.conf.json)
sudo tee /etc/orchestrator.conf.json >/dev/null <<'EOF'
{
  "MySQLTopologyCredentialsConfigFile": "/etc/orchestrator-topology.cnf",
  "MySQLOrchestratorCredentialsConfigFile": "/etc/orchestrator.cnf",
  "DiscoverByShowSlaveHosts": true,
  "AutoDiscover": true,
  "RecoverMasterClusterFilters": ["^prod-mysql-"],
  "RecoverIntermediateMasterClusterFilters": ["^prod-mysql-"],
  "RaftEnabled": true,
  "RaftBind": "0.0.0.0",
  "RaftNodes": ["10.0.0.11:10008","10.0.0.12:10008","10.0.0.13:10008"]
}
EOF

# 4) 连接凭据(拓扑账号/管理账号)
sudo tee /etc/orchestrator-topology.cnf >/dev/null <<'EOF'
[client]
user=repl
password=Repl@123
EOF

sudo tee /etc/orchestrator.cnf >/dev/null <<'EOF'
[client]
user=orchestrator
password=Orch@123
EOF

# 5) 启动服务并发现拓扑
sudo systemctl enable --now orchestrator
orchestrator-client -c discover -i 10.0.0.21:3306

# 预期效果:Orchestrator UI可见拓扑结构并显示主从关系

示例:Keepalived + VIP 漂移(与ProxySQL联动)

# /etc/keepalived/keepalived.conf(主节点)
vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 150
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 123456
    }
    virtual_ipaddress {
        10.0.0.100/24
    }
    track_script {
        chk_proxysql
    }
}

# /etc/keepalived/check_proxysql.sh
#!/bin/bash
mysqladmin -h 127.0.0.1 -uadmin -pAdmin@123 ping >/dev/null 2>&1
exit $?

# 启用脚本
chmod +x /etc/keepalived/check_proxysql.sh
systemctl restart keepalived

示例:ProxySQL 读写分离与故障隔离

-- 连接到ProxySQL管理端口
mysql -uadmin -pAdmin@123 -h 127.0.0.1 -P6032

-- 添加后端MySQL
INSERT INTO mysql_servers(hostgroup_id,hostname,port,weight) VALUES (10,'10.0.0.21',3306,100); -- master
INSERT INTO mysql_servers(hostgroup_id,hostname,port,weight) VALUES (20,'10.0.0.22',3306,100); -- slave
INSERT INTO mysql_servers(hostgroup_id,hostname,port,weight) VALUES (20,'10.0.0.23',3306,100); -- slave
LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;

-- 读写分离规则
INSERT INTO mysql_query_rules(rule_id,active,match_pattern,destination_hostgroup,apply)
VALUES (1,1,'^SELECT',20,1);
INSERT INTO mysql_query_rules(rule_id,active,match_pattern,destination_hostgroup,apply)
VALUES (2,1,'^UPDATE|^INSERT|^DELETE|^REPLACE',10,1);
LOAD MYSQL QUERY RULES TO RUNTIME; SAVE MYSQL QUERY RULES TO DISK;

故障切换示例(Orchestrator手工触发,验证流程)

# 1) 模拟主库故障
ssh mysql-master "sudo systemctl stop mysqld"

# 2) 查看拓扑与触发恢复
orchestrator-client -c topology -i 10.0.0.21:3306
orchestrator-client -c recover -i 10.0.0.21:3306

# 3) 预期效果
# - Orchestrator选举新主库并提升
# - ProxySQL将写流量切到新主
# - VIP若绑定在ProxySQL节点,业务连接无需变更

自愈能力建议覆盖三类场景:节点宕机自动拉起(systemd与监控联动)、复制中断自动修复(重建从库或自动跳过)、与节点恢复后的自动回归(重新挂载与补全位点)。故障自愈策略要与监控告警系统联动,如Prometheus+Alertmanager触发Webhook,由编排工具或脚本执行修复动作,并对关键动作设置审批或灰度开关。

自动修复脚本示例(复制中断自动修复)

#!/bin/bash
# /usr/local/bin/fix_replication.sh
# 说明:检测复制错误并执行简单跳过(仅适用于非关键业务)
ERR=$(mysql -uroot -p'Root@123' -e "SHOW SLAVE STATUS\G" | awk -F': ' '/Last_SQL_Error/ {print $2}')
if [[ -n "$ERR" ]]; then
  mysql -uroot -p'Root@123' -e "STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE;"
  echo "replication fixed: skipped 1 event"
fi

排错要点与命令(高频问题)

# 1) 复制延迟或中断
mysql -uroot -p -e "SHOW SLAVE STATUS\G" | egrep "Seconds_Behind_Master|Last_SQL_Error|Last_IO_Error"

# 2) GTID不一致
mysql -uroot -p -e "SHOW GLOBAL VARIABLES LIKE 'gtid_mode';"
mysql -uroot -p -e "SHOW MASTER STATUS\G"

# 3) Orchestrator无法发现拓扑
grep -n "ERROR" /var/log/orchestrator/orchestrator.log
orchestrator-client -c discover -i 10.0.0.21:3306

# 4) ProxySQL后端下线
mysql -uadmin -pAdmin@123 -h 127.0.0.1 -P6032 -e "SELECT * FROM mysql_servers\G"

# 5) VIP漂移失败
ip addr show | grep 10.0.0.100
systemctl status keepalived

实施实践中应注意:切换前后的数据一致性验证(GTID一致、秒级延迟阈值)、读写分离路由的延迟与缓存刷新、业务连接池重连策略,以及切换失败的回滚路径。建议建立演练机制与变更审计,定期进行故障注入(主库宕机、网络抖动、延迟拉大、磁盘只读)以校验工具有效性,并形成统一的SOP与应急预案。

练习:
1. 基于三节点MySQL异步复制,部署Orchestrator并完成一次手动切换演练,记录切换时长与GTID一致性检查结果。
2. 使用ProxySQL配置读写分离,在从库停机后验证读请求自动回退到主库。
3. 通过Keepalived模拟VIP漂移,观察业务连接是否无感知重连。
4. 编写简单自愈脚本,故障注入后自动修复复制中断并输出修复日志。