14.7.5 运营维护与容量评估
运营维护与容量评估#
运营维护以“稳定性、可用性、成本”三目标为主线,围绕连接数、吞吐、延迟、路由命中与后端负载建立可量化SLA与扩容触发标准。以下内容包含原理草图、监控查询、配置示例、排错与练习。
原理草图:容量评估链路
一、日常运营维护要点(含命令示例)
1) 配置一致性校验
-- 连接Admin接口
mysql -uadmin -padmin -h127.0.0.1 -P6032
-- 对比runtime与disk配置
SELECT 'mysql_servers' AS tbl, COUNT(*) FROM runtime_mysql_servers;
SELECT 'mysql_servers' AS tbl, COUNT(*) FROM disk.mysql_servers;
-- 如不一致,执行持久化
LOAD MYSQL SERVERS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;
预期效果:runtime与disk行数一致,配置变更可在重启后生效。
2) 路由规则健康检查
-- 规则命中统计
SELECT rule_id, active, match_pattern, hits
FROM stats_mysql_query_rules
ORDER BY hits DESC
LIMIT 10;
-- 规则冲突与冗余排查(同指纹匹配多条)
SELECT digest, COUNT(*) AS c
FROM stats_mysql_query_digest
GROUP BY digest
HAVING c > 1;
3) 连接池利用率
SELECT hostgroup, srv_host, srv_port,
ConnUsed, ConnFree, ConnOK, ConnERR
FROM stats_mysql_connection_pool
ORDER BY ConnUsed DESC
LIMIT 10;
解释:ConnUsed高且ConnFree持续低,说明连接池容量偏紧。
4) 后端实例健康
SELECT hostgroup_id, hostname, port, status, weight, max_connections
FROM runtime_mysql_servers
ORDER BY hostgroup_id, weight DESC;
二、容量评估指标体系(含阈值与计算示例)
- 并发连接容量
- 关注指标:前端连接数、连接池上限、后端
max_connections - 计算示例:
# 统计前端连接数(Admin接口)
mysql -uadmin -padmin -h127.0.0.1 -P6032 -e \
"SELECT Variable_Value FROM stats_mysql_global WHERE Variable_Name='Client_Connections';"
-
经验阈值:峰值连接数 / 总连接上限 < 70%
-
查询吞吐容量
- 关注指标:QPS、TPS、平均延迟、p95/p99
SELECT SUM(queries) AS QPS
FROM stats_mysql_global
WHERE Variable_Name='Queries';
SELECT AVG(time_total_us/exec_count) AS avg_us,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY time_total_us/exec_count) AS p95_us
FROM stats_mysql_query_digest;
- 路由与规则容量
- 规则数 > 200 需优化结构
SELECT COUNT(*) AS rules_total FROM runtime_mysql_query_rules;
- 后端实例容量
- 单实例CPU持续>70%且延迟上升 → 增加后端节点
三、容量评估流程(带可执行示例)
1) 收集基线
# 每5分钟采集一次关键指标
while true; do
mysql -uadmin -padmin -h127.0.0.1 -P6032 -e \
"SELECT NOW(), Variable_Name, Variable_Value
FROM stats_mysql_global
WHERE Variable_Name IN ('Client_Connections','Queries','MySQL_Thread');"
sleep 300
done >> /var/log/proxysql/capacity_baseline.log
2) 趋势分析
结合Prometheus/日志工具绘制趋势图,识别拐点。
3) 瓶颈定位
-- 区分代理瓶颈还是后端瓶颈
SELECT Variable_Name, Variable_Value
FROM stats_mysql_global
WHERE Variable_Name IN ('MySQL_Thread','MySQL_Threads_running','Client_Connections');
SELECT hostgroup, srv_host, ConnERR, ConnOK
FROM stats_mysql_connection_pool
ORDER BY ConnERR DESC;
4) 扩容方案验证
-- 新增后端实例并加载
INSERT INTO mysql_servers(hostgroup_id,hostname,port,weight,max_connections)
VALUES (10,'10.0.0.12',3306,1000,2000);
LOAD MYSQL SERVERS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;
四、常见容量风险与排错指引
1) 连接池耗尽
- 现象:ConnERR上升、客户端报错“Too many connections”
- 排错命令:
SELECT hostgroup, srv_host, ConnUsed, ConnFree, ConnERR
FROM stats_mysql_connection_pool
ORDER BY ConnERR DESC;
- 处理:
-- 临时提升后端连接上限
UPDATE mysql_servers SET max_connections=3000 WHERE hostgroup_id=10;
LOAD MYSQL SERVERS TO RUNTIME;
2) 规则过多导致路由延迟
- 现象:p95延迟升高、规则命中降低
- 排错命令:
SELECT COUNT(*) AS rules_total FROM runtime_mysql_query_rules;
SELECT rule_id, hits FROM stats_mysql_query_rules ORDER BY hits ASC LIMIT 10;
- 处理:合并规则、减少正则匹配
3) 读写不均衡
- 现象:读库负载集中、复制延迟升高
- 排错命令:
SELECT hostgroup, srv_host, ConnUsed, ConnOK
FROM stats_mysql_connection_pool
WHERE hostgroup IN (20,21)
ORDER BY ConnUsed DESC;
- 处理:调整权重或补充只读实例
五、运营输出物模板(示例片段)
# ProxySQL容量评估报告
- 峰值QPS:18,000
- p95延迟:12ms
- 峰值连接:6,200
- 扩容建议:新增2个读库实例,ProxySQL横向扩2节点
- 回滚方案:下线新实例并恢复权重
六、练习与自检
1) 采集stats_mysql_global的Client_Connections与Queries并绘制1天趋势图。
2) 人为制造规则冗余,验证命中率变化并优化规则。
3) 模拟后端实例CPU高负载,调整权重观察ConnUsed变化。