tidb自带有三个监控工具
tidb dashboard pd,地址:http://10.0.10.17:2379/dashboard
grafana:http://10.0.10.11:3000/?orgId=1
Alertmanager:http://10.0.10.11:9093/#/alerts
这个工具可以帮助我们获取不少信息。
10.0.10.17是pd服务器,10.0.10.11是监控服务器
Alertmanager,上面看到TiKV GC CAN NOT WORK这个warning,并且一直都存在。
官方说法:在 6 小时内 Region 上没有成功执行 GC,说明 GC 不能正常工作了。短期内 GC 不运行不会造成太大的影响,但如果 GC 一直不运行,版本会越来越多,从而导致查询变慢。
1,查一下gc leader主机
mysql> SELECT VARIABLE_NAME, VARIABLE_VALUE FROM mysql.tidb WHERE VARIABLE_NAME LIKE 'tikv_gc%'; +--------------------------+-----------------------------------------------------------------------------------------+ | VARIABLE_NAME | VARIABLE_VALUE | +--------------------------+-----------------------------------------------------------------------------------------+ | tikv_gc_leader_uuid | 5dadecc7e80000f | | tikv_gc_leader_desc | host:tidb1, pid:68720, start at 2020-12-31 15:28:58.309492429 +0800 CST m=+13.138605313 | | tikv_gc_leader_lease | 20210223-21:02:58 +0800 | | tikv_gc_enable | true | | tikv_gc_run_interval | 10m0s | | tikv_gc_life_time | 30m | | tikv_gc_last_run_time | 20210223-20:55:58 +0800 | | tikv_gc_safe_point | 20210222-14:56:10 +0800 | | tikv_gc_auto_concurrency | true | | tikv_gc_mode | distributed | +--------------------------+-----------------------------------------------------------------------------------------+ 10 rows in set (0.00 sec)
从结果可以看出,gc leader是tidb1这台机器,tikv_gc_safe_point开始于20210222-14:56:10 +0800 ,并且是没有结束的。
2,在tidb1上面,查日志
# grep gc_worker /home/tidb/tidb-deploy/tidb-4000/log/tidb.log
[2021/02/23 18:57:39.207 +08:00] [INFO] [gc_worker.go:678] ["[gc worker] start redo-delete ranges"] [uuid=5dadecc7e80000f] ["num of ranges"=0]
[2021/02/23 18:57:39.207 +08:00] [INFO] [gc_worker.go:707] ["[gc worker] finish redo-delete ranges"] [uuid=5dadecc7e80000f] ["num of ranges"=0] ["cost time"=677ns]
[2021/02/23 18:57:39.210 +08:00] [INFO] [gc_worker.go:1469] ["[gc worker] sent safe point to PD"] [uuid=5dadecc7e80000f] ["safe point"=423094378829447183]
[2021/02/23 19:05:58.371 +08:00] [INFO] [gc_worker.go:527] ["[gc worker] there's another service in the cluster requires an earlier safe point. gc will continue with the earlier one"] [uuid=5dadecc7e80000f] [ourSafePoint=423120485304762368] [minSafePoint=423094378829447183]
这里的uuid=5dadecc7e80000f根命令行下tikv_gc_leader_uuid的值是一样的,可以确定,影响gc的日志就在这儿了。
3,找原因
mysql> show processlist; +------+---------------+-------------+--------------------+---------+------+-------+------------------+ | Id | User | Host | db | Command | Time | State | Info | +------+---------------+-------------+--------------------+---------+------+-------+------------------+ | 508 | root | 10.0.10.11 | INFORMATION_SCHEMA | Query | 0 | 2 | show processlist | | 503 | report_select | 10.0.51.208 | ziwei | Sleep | 8405 | 2 | NULL | | 502 | report_select | 10.0.51.208 | ziwei | Sleep | 7977 | 2 | NULL | +------+---------------+-------------+--------------------+---------+------+-------+------------------+ 3 rows in set (0.00 sec) mysql> kill tidb 502; //杀进程 mysql> kill tidb 503;
杀完后没有什么用,重启tidb也不启作用。
http://10.0.10.11:9093/#/alerts,报了很多ticdc的warning,说有一个是延迟很长,测试了一下并不长,
[tidb@jiankong ~]$ tiup ctl cdc changefeed list --pd=http://10.0.10.19:2379 Found ctl newer version: The latest version: v5.0.0-rc Local installed version: v4.0.9 Update current component: tiup update ctl Update all components: tiup update --all Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.9/ctl cdc changefeed list --pd=http://10.0.10.19:2379 [ { "id": "tidb-to-18-mysql", "summary": { "state": "normal", "tso": 423134310844858374, "checkpoint": "2021-02-24 09:14:58.572", "error": null } }, { "id": "tidb2ziwei-result", "summary": { "state": "stopped", "tso": 423094378829447183, "checkpoint": "2021-02-22 14:56:10.022", "error": null } }, { "id": "tidb2ziwei-saas-master", "summary": { "state": "normal", "tso": 423122962998099985, "checkpoint": "2021-02-23 21:13:29.972", "error": null } } ]
stopped的checkpoint的时间和tikv_gc_safe_point的时间一模一样。
stopped的checkpoint的时间和tikv_gc_safe_point的时间一模一样。
stopped的checkpoint的时间和tikv_gc_safe_point的时间一模一样。
启动停掉的ticdc task
# tiup ctl cdc changefeed resume --pd=http://10.0.10.18:2379 -c tidb2ziwei-result
然后warning没有了,手动停掉ticdc task尽然会影响到GC工作,很是无语。tidb还是要多多改进。
4,看一下grafana的监控
转载请注明
作者:海底苍鹰
地址:http://blog.51yip.com/tidb/2503.html