TiKV_GC_can_not_work 解决

张映发表于 2021-02-24

分类目录： tidb

标签：tidb, TiKV_GC_can_not_work

tidb自带有三个监控工具

tidb dashboard pd，地址：http://10.0.10.17:2379/dashboard

grafana：http://10.0.10.11:3000/?orgId=1

Alertmanager：http://10.0.10.11:9093/#/alerts

这个工具可以帮助我们获取不少信息。

10.0.10.17是pd服务器，10.0.10.11是监控服务器

Alertmanager，上面看到TiKV GC CAN NOT WORK这个warning，并且一直都存在。

官方说法：在 6 小时内 Region 上没有成功执行 GC，说明 GC 不能正常工作了。短期内 GC 不运行不会造成太大的影响，但如果 GC 一直不运行，版本会越来越多，从而导致查询变慢。

1，查一下gc leader主机

mysql>  SELECT VARIABLE_NAME, VARIABLE_VALUE FROM mysql.tidb WHERE VARIABLE_NAME LIKE 'tikv_gc%';
+--------------------------+-----------------------------------------------------------------------------------------+
| VARIABLE_NAME            | VARIABLE_VALUE                                                                          |
+--------------------------+-----------------------------------------------------------------------------------------+
| tikv_gc_leader_uuid      | 5dadecc7e80000f                                                                         |
| tikv_gc_leader_desc      | host:tidb1, pid:68720, start at 2020-12-31 15:28:58.309492429 +0800 CST m=+13.138605313 |
| tikv_gc_leader_lease     | 20210223-21:02:58 +0800                                                                 |
| tikv_gc_enable           | true                                                                                    |
| tikv_gc_run_interval     | 10m0s                                                                                   |
| tikv_gc_life_time        | 30m                                                                                     |
| tikv_gc_last_run_time    | 20210223-20:55:58 +0800                                                                 |
| tikv_gc_safe_point       | 20210222-14:56:10 +0800                                                                 |
| tikv_gc_auto_concurrency | true                                                                                    |
| tikv_gc_mode             | distributed                                                                             |
+--------------------------+-----------------------------------------------------------------------------------------+
10 rows in set (0.00 sec)

从结果可以看出，gc leader是tidb1这台机器，tikv_gc_safe_point开始于20210222-14:56:10 +0800 ，并且是没有结束的。

2，在tidb1上面，查日志

# grep gc_worker /home/tidb/tidb-deploy/tidb-4000/log/tidb.log

[2021/02/23 18:57:39.207 +08:00] [INFO] [gc_worker.go:678] ["[gc worker] start redo-delete ranges"] [uuid=5dadecc7e80000f] ["num of ranges"=0]
[2021/02/23 18:57:39.207 +08:00] [INFO] [gc_worker.go:707] ["[gc worker] finish redo-delete ranges"] [uuid=5dadecc7e80000f] ["num of ranges"=0] ["cost time"=677ns]
[2021/02/23 18:57:39.210 +08:00] [INFO] [gc_worker.go:1469] ["[gc worker] sent safe point to PD"] [uuid=5dadecc7e80000f] ["safe point"=423094378829447183]
[2021/02/23 19:05:58.371 +08:00] [INFO] [gc_worker.go:527] ["[gc worker] there's another service in the cluster requires an earlier safe point. gc will continue with the earlier one"] [uuid=5dadecc7e80000f] [ourSafePoint=423120485304762368] [minSafePoint=423094378829447183]

这里的uuid=5dadecc7e80000f根命令行下tikv_gc_leader_uuid的值是一样的，可以确定，影响gc的日志就在这儿了。

3，找原因

mysql> show processlist;
+------+---------------+-------------+--------------------+---------+------+-------+------------------+
| Id   | User          | Host        | db                 | Command | Time | State | Info             |
+------+---------------+-------------+--------------------+---------+------+-------+------------------+
|  508 | root          | 10.0.10.11  | INFORMATION_SCHEMA | Query   |    0 | 2     | show processlist |
|  503 | report_select | 10.0.51.208 | ziwei              | Sleep   | 8405 | 2     | NULL             |
|  502 | report_select | 10.0.51.208 | ziwei              | Sleep   | 7977 | 2     | NULL             |
+------+---------------+-------------+--------------------+---------+------+-------+------------------+
3 rows in set (0.00 sec)

mysql> kill tidb 502;  //杀进程

mysql> kill tidb 503;

杀完后没有什么用，重启tidb也不启作用。

http://10.0.10.11:9093/#/alerts，报了很多ticdc的warning，说有一个是延迟很长，测试了一下并不长，

[tidb@jiankong ~]$ tiup ctl cdc changefeed list  --pd=http://10.0.10.19:2379
Found ctl newer version:

    The latest version:         v5.0.0-rc
    Local installed version:    v4.0.9
    Update current component:   tiup update ctl
    Update all components:      tiup update --all

Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.9/ctl cdc changefeed list --pd=http://10.0.10.19:2379
[
  {
    "id": "tidb-to-18-mysql",
    "summary": {
      "state": "normal",
      "tso": 423134310844858374,
      "checkpoint": "2021-02-24 09:14:58.572",
      "error": null
    }
  },
  {
    "id": "tidb2ziwei-result",
    "summary": {
      "state": "stopped",
      "tso": 423094378829447183,
      "checkpoint": "2021-02-22 14:56:10.022",
      "error": null
    }
  },
  {
    "id": "tidb2ziwei-saas-master",
    "summary": {
      "state": "normal",
      "tso": 423122962998099985,
      "checkpoint": "2021-02-23 21:13:29.972",
      "error": null
    }
  }
]

stopped的checkpoint的时间和tikv_gc_safe_point的时间一模一样。
stopped的checkpoint的时间和tikv_gc_safe_point的时间一模一样。
stopped的checkpoint的时间和tikv_gc_safe_point的时间一模一样。

启动停掉的ticdc task

# tiup ctl cdc changefeed resume --pd=http://10.0.10.18:2379 -c tidb2ziwei-result

然后warning没有了，手动停掉ticdc task尽然会影响到GC工作，很是无语。tidb还是要多多改进。

4，看一下grafana的监控

grafana gc 监控

转载请注明
作者:海底苍鹰
地址:http://blog.51yip.com/tidb/2503.html

留下评论

抱歉，发表回复评论您必须登录。

海底苍鹰(tank)博客

－－一步，二步，三步，N步，二行脚印

赞助本站

关于我

留言板

开发手册

linux命令

首页

TiKV_GC_can_not_work 解决

留下评论

分类目录

最近文章

最近评论和留言

登录

海底苍鹰(tank)博客

－－一步，二步，三步，N步，二行脚印

赞助本站 关于我 留言板 开发手册 linux命令 首页

TiKV_GC_can_not_work 解决

留下评论

分类目录

最近文章

最近评论和留言

登录

赞助本站

关于我

留言板

开发手册

linux命令

首页