TiKV_GC_can_not_work 解决

张映 发表于 2021-02-24

分类目录: tidb

标签:,

tidb自带有三个监控工具

tidb dashboard pd,地址:http://10.0.10.17:2379/dashboard

grafana:http://10.0.10.11:3000/?orgId=1

Alertmanager:http://10.0.10.11:9093/#/alerts

这个工具可以帮助我们获取不少信息。

10.0.10.17是pd服务器,10.0.10.11是监控服务器

Alertmanager,上面看到TiKV GC CAN NOT WORK这个warning,并且一直都存在。

官方说法:在 6 小时内 Region 上没有成功执行 GC,说明 GC 不能正常工作了。短期内 GC 不运行不会造成太大的影响,但如果 GC 一直不运行,版本会越来越多,从而导致查询变慢

1,查一下gc leader主机

mysql>  SELECT VARIABLE_NAME, VARIABLE_VALUE FROM mysql.tidb WHERE VARIABLE_NAME LIKE 'tikv_gc%';
+--------------------------+-----------------------------------------------------------------------------------------+
| VARIABLE_NAME            | VARIABLE_VALUE                                                                          |
+--------------------------+-----------------------------------------------------------------------------------------+
| tikv_gc_leader_uuid      | 5dadecc7e80000f                                                                         |
| tikv_gc_leader_desc      | host:tidb1, pid:68720, start at 2020-12-31 15:28:58.309492429 +0800 CST m=+13.138605313 |
| tikv_gc_leader_lease     | 20210223-21:02:58 +0800                                                                 |
| tikv_gc_enable           | true                                                                                    |
| tikv_gc_run_interval     | 10m0s                                                                                   |
| tikv_gc_life_time        | 30m                                                                                     |
| tikv_gc_last_run_time    | 20210223-20:55:58 +0800                                                                 |
| tikv_gc_safe_point       | 20210222-14:56:10 +0800                                                                 |
| tikv_gc_auto_concurrency | true                                                                                    |
| tikv_gc_mode             | distributed                                                                             |
+--------------------------+-----------------------------------------------------------------------------------------+
10 rows in set (0.00 sec)

从结果可以看出,gc leader是tidb1这台机器,tikv_gc_safe_point开始于20210222-14:56:10 +0800 ,并且是没有结束的。

2,在tidb1上面,查日志

# grep gc_worker /home/tidb/tidb-deploy/tidb-4000/log/tidb.log

[2021/02/23 18:57:39.207 +08:00] [INFO] [gc_worker.go:678] ["[gc worker] start redo-delete ranges"] [uuid=5dadecc7e80000f] ["num of ranges"=0]
[2021/02/23 18:57:39.207 +08:00] [INFO] [gc_worker.go:707] ["[gc worker] finish redo-delete ranges"] [uuid=5dadecc7e80000f] ["num of ranges"=0] ["cost time"=677ns]
[2021/02/23 18:57:39.210 +08:00] [INFO] [gc_worker.go:1469] ["[gc worker] sent safe point to PD"] [uuid=5dadecc7e80000f] ["safe point"=423094378829447183]
[2021/02/23 19:05:58.371 +08:00] [INFO] [gc_worker.go:527] ["[gc worker] there's another service in the cluster requires an earlier safe point. gc will continue with the earlier one"] [uuid=5dadecc7e80000f] [ourSafePoint=423120485304762368] [minSafePoint=423094378829447183]

这里的uuid=5dadecc7e80000f根命令行下tikv_gc_leader_uuid的值是一样的,可以确定,影响gc的日志就在这儿了。

3,找原因

mysql> show processlist;
+------+---------------+-------------+--------------------+---------+------+-------+------------------+
| Id   | User          | Host        | db                 | Command | Time | State | Info             |
+------+---------------+-------------+--------------------+---------+------+-------+------------------+
|  508 | root          | 10.0.10.11  | INFORMATION_SCHEMA | Query   |    0 | 2     | show processlist |
|  503 | report_select | 10.0.51.208 | ziwei              | Sleep   | 8405 | 2     | NULL             |
|  502 | report_select | 10.0.51.208 | ziwei              | Sleep   | 7977 | 2     | NULL             |
+------+---------------+-------------+--------------------+---------+------+-------+------------------+
3 rows in set (0.00 sec)

mysql> kill tidb 502;  //杀进程

mysql> kill tidb 503;

杀完后没有什么用,重启tidb也不启作用。

http://10.0.10.11:9093/#/alerts,报了很多ticdc的warning,说有一个是延迟很长,测试了一下并不长,

[tidb@jiankong ~]$ tiup ctl cdc changefeed list  --pd=http://10.0.10.19:2379
Found ctl newer version:

    The latest version:         v5.0.0-rc
    Local installed version:    v4.0.9
    Update current component:   tiup update ctl
    Update all components:      tiup update --all

Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.9/ctl cdc changefeed list --pd=http://10.0.10.19:2379
[
  {
    "id": "tidb-to-18-mysql",
    "summary": {
      "state": "normal",
      "tso": 423134310844858374,
      "checkpoint": "2021-02-24 09:14:58.572",
      "error": null
    }
  },
  {
    "id": "tidb2ziwei-result",
    "summary": {
      "state": "stopped",
      "tso": 423094378829447183,
      "checkpoint": "2021-02-22 14:56:10.022",
      "error": null
    }
  },
  {
    "id": "tidb2ziwei-saas-master",
    "summary": {
      "state": "normal",
      "tso": 423122962998099985,
      "checkpoint": "2021-02-23 21:13:29.972",
      "error": null
    }
  }
]

stopped的checkpoint的时间和tikv_gc_safe_point的时间一模一样。
stopped的checkpoint的时间和tikv_gc_safe_point的时间一模一样。
stopped的checkpoint的时间和tikv_gc_safe_point的时间一模一样。

启动停掉的ticdc task

# tiup ctl cdc changefeed resume --pd=http://10.0.10.18:2379 -c tidb2ziwei-result

然后warning没有了,手动停掉ticdc task尽然会影响到GC工作,很是无语。tidb还是要多多改进。

4,看一下grafana的监控

grafana gc 监控

grafana gc 监控



转载请注明
作者:海底苍鹰
地址:http://blog.51yip.com/tidb/2503.html