14.7.5 运营维护与容量评估

运营维护与容量评估#

运营维护以“稳定性、可用性、成本”三目标为主线,围绕连接数、吞吐、延迟、路由命中与后端负载建立可量化SLA与扩容触发标准。以下内容包含原理草图、监控查询、配置示例、排错与练习。

原理草图:容量评估链路

文章图片

一、日常运营维护要点(含命令示例)

1) 配置一致性校验

-- 连接Admin接口
mysql -uadmin -padmin -h127.0.0.1 -P6032

-- 对比runtime与disk配置
SELECT 'mysql_servers' AS tbl, COUNT(*) FROM runtime_mysql_servers;
SELECT 'mysql_servers' AS tbl, COUNT(*) FROM disk.mysql_servers;

-- 如不一致,执行持久化
LOAD MYSQL SERVERS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;

预期效果:runtime与disk行数一致,配置变更可在重启后生效。

2) 路由规则健康检查

-- 规则命中统计
SELECT rule_id, active, match_pattern, hits
FROM stats_mysql_query_rules
ORDER BY hits DESC
LIMIT 10;

-- 规则冲突与冗余排查(同指纹匹配多条)
SELECT digest, COUNT(*) AS c
FROM stats_mysql_query_digest
GROUP BY digest
HAVING c > 1;

3) 连接池利用率

SELECT hostgroup, srv_host, srv_port,
       ConnUsed, ConnFree, ConnOK, ConnERR
FROM stats_mysql_connection_pool
ORDER BY ConnUsed DESC
LIMIT 10;

解释ConnUsed高且ConnFree持续低,说明连接池容量偏紧。

4) 后端实例健康

SELECT hostgroup_id, hostname, port, status, weight, max_connections
FROM runtime_mysql_servers
ORDER BY hostgroup_id, weight DESC;

二、容量评估指标体系(含阈值与计算示例)

  • 并发连接容量
  • 关注指标:前端连接数、连接池上限、后端max_connections
  • 计算示例:
# 统计前端连接数(Admin接口)
mysql -uadmin -padmin -h127.0.0.1 -P6032 -e \
"SELECT Variable_Value FROM stats_mysql_global WHERE Variable_Name='Client_Connections';"
  • 经验阈值:峰值连接数 / 总连接上限 < 70%

  • 查询吞吐容量

  • 关注指标:QPS、TPS、平均延迟、p95/p99
SELECT SUM(queries) AS QPS
FROM stats_mysql_global
WHERE Variable_Name='Queries';

SELECT AVG(time_total_us/exec_count) AS avg_us,
       PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY time_total_us/exec_count) AS p95_us
FROM stats_mysql_query_digest;
  • 路由与规则容量
  • 规则数 > 200 需优化结构
SELECT COUNT(*) AS rules_total FROM runtime_mysql_query_rules;
  • 后端实例容量
  • 单实例CPU持续>70%且延迟上升 → 增加后端节点

三、容量评估流程(带可执行示例)

1) 收集基线

# 每5分钟采集一次关键指标
while true; do
  mysql -uadmin -padmin -h127.0.0.1 -P6032 -e \
  "SELECT NOW(), Variable_Name, Variable_Value
   FROM stats_mysql_global
   WHERE Variable_Name IN ('Client_Connections','Queries','MySQL_Thread');"
  sleep 300
done >> /var/log/proxysql/capacity_baseline.log

2) 趋势分析
结合Prometheus/日志工具绘制趋势图,识别拐点。

3) 瓶颈定位

-- 区分代理瓶颈还是后端瓶颈
SELECT Variable_Name, Variable_Value
FROM stats_mysql_global
WHERE Variable_Name IN ('MySQL_Thread','MySQL_Threads_running','Client_Connections');

SELECT hostgroup, srv_host, ConnERR, ConnOK
FROM stats_mysql_connection_pool
ORDER BY ConnERR DESC;

4) 扩容方案验证

-- 新增后端实例并加载
INSERT INTO mysql_servers(hostgroup_id,hostname,port,weight,max_connections)
VALUES (10,'10.0.0.12',3306,1000,2000);
LOAD MYSQL SERVERS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;

四、常见容量风险与排错指引

1) 连接池耗尽
- 现象:ConnERR上升、客户端报错“Too many connections”
- 排错命令:

SELECT hostgroup, srv_host, ConnUsed, ConnFree, ConnERR
FROM stats_mysql_connection_pool
ORDER BY ConnERR DESC;
  • 处理:
-- 临时提升后端连接上限
UPDATE mysql_servers SET max_connections=3000 WHERE hostgroup_id=10;
LOAD MYSQL SERVERS TO RUNTIME;

2) 规则过多导致路由延迟
- 现象:p95延迟升高、规则命中降低
- 排错命令:

SELECT COUNT(*) AS rules_total FROM runtime_mysql_query_rules;
SELECT rule_id, hits FROM stats_mysql_query_rules ORDER BY hits ASC LIMIT 10;
  • 处理:合并规则、减少正则匹配

3) 读写不均衡
- 现象:读库负载集中、复制延迟升高
- 排错命令:

SELECT hostgroup, srv_host, ConnUsed, ConnOK
FROM stats_mysql_connection_pool
WHERE hostgroup IN (20,21)
ORDER BY ConnUsed DESC;
  • 处理:调整权重或补充只读实例

五、运营输出物模板(示例片段)

# ProxySQL容量评估报告
- 峰值QPS:18,000
- p95延迟:12ms
- 峰值连接:6,200
- 扩容建议:新增2个读库实例,ProxySQL横向扩2节点
- 回滚方案:下线新实例并恢复权重

六、练习与自检

1) 采集stats_mysql_globalClient_ConnectionsQueries并绘制1天趋势图。
2) 人为制造规则冗余,验证命中率变化并优化规则。
3) 模拟后端实例CPU高负载,调整权重观察ConnUsed变化。