Raft Log Synchronization Delay Alert TiKV_raft_log_lag

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: raftlog同步延迟告警TiKV_raft_log_lag

| username: Jellybean

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version]
v5.4.0

[Reproduction Path] What operations were performed to encounter the issue
[Encountered Issue: Issue Phenomenon and Impact]

  1. Due to historical reasons and business needs, a new cluster of v5.4.0 was deployed. The cluster has not yet been integrated into the business, but a TiKV_raft_log_lag delay alert appeared.
    The meaning of this alert TiKV_raft_log_lag:
  • Alert rule: histogram_quantile(0.99, sum(rate(tikv_raftstore_log_lag_bucket[1m])) by (le, instance)) > 5000
  • Rule description: A high value indicates that the Follower is far behind the Leader, and Raft cannot synchronize normally. Possible reasons are that the TiKV where the Follower is located is stuck or has crashed.
  1. The cluster has no write traffic, and all nodes such as tpd/tidb/tikv are in normal status with no crashed nodes. There are no abnormal nodes in the cluster.
    Looking at the PD panel, the cluster has 30 regions, all of which are empty regions.

  2. One TiKV node has a high TiKV_raft_log_lag, possibly because the TiKV where the Follower is located is stuck.

  3. Investigating node logs

  1. Checking the node logs, there is no ERRO abnormal information, but it keeps logging “try to transfer leader”.
    [2023/12/19 10:23:31.445 +08:00] [INFO] [pd.rs:1273] [“try to transfer leader”] [to_peer=“id: 633 store_id: 1 role: IncomingVoter”] [from_peer=“id: 558 store_id: 397 role: DemotingVoter”] [region_id=557]
    It is speculated that there might be an issue with the raft process, causing the Leader to not switch normally.
    store_id: 397 is the TiKV node with the alert.
  2. Investigating regions 577, 633, and 558, no abnormal information was found.
  3. Performing pd-ctl region check on the cluster, no abnormal or damaged regions were found, only the 30 empty regions.
  4. Checking the issue period, PD indeed generated an operator scheduling strategy to scatter regions, TiKV executed adding 3 new learners, then transferred the three new learners to voters, converted 3 old voters to learners, and then got stuck transferring the leader from 397 to 633. The GRPC interaction between PD and TiKV was normal, but TiKV’s scheduling execution could not proceed and kept trying.
  5. Leader election stuck in an intermediate state
  1. Attempt to recover
  1. Since all 30 regions in the cluster are empty regions, tried to increase merge-schedule-limit, max-merge-region-keys, max-merge-region-size, but it did not optimize the issue, the problem persisted.
  2. Restarting the node resolved the issue.
    After the restart, the peer_ID 558 on the fault point store_id 397 was removed.
  1. After the restart, the alert disappeared, and the monitoring curve returned to normal.

  2. There was a similar issue in the community before, but no solution was proposed: TiKV节点出现大量的TiKV_raft_log_lag的问题 - TiDB 的问答社区
    This issue is similar to an official bugfix phenomenon:

  1. Conclusion: This is an issue where the raft log synchronization from the leader to the follower caused the raft client thread to get stuck, blocking PD’s scheduling tasks and raft GC. Temporarily, restarting the node resolved the issue.

However, this poses operational risks and still needs a solution. Has anyone encountered a similar issue, and are there any other handling ideas?

| username: Billmay表妹 | Original post link

This issue appears to have been fixed in version 7.1 and later. Is it that the current upgrade plan doesn’t quite suit you? So you’re looking to see if there are any other solutions?

| username: Jellybean | Original post link

Yes, upgrading to a higher version would relatively improve the stability and performance of TiKV significantly. However, it is currently not suitable to perform an upgrade due to business considerations. We would like to identify the root cause of this issue and explore temporary workaround solutions.

| username: Billmay表妹 | Original post link

Then let’s discuss it at the moderator exchange meeting.

| username: Jellybean | Original post link

By writing a large amount of data to the cluster, it was found that all TiKV nodes experienced varying degrees of raft log delay. According to the scheduling operator of PD, it was confirmed that it was unable to schedule normally to a certain store, and there were also some DiskAlmostFull INFO logs.

It was eventually confirmed that a missing NVMe data disk caused the node to be mounted on the root directory, resulting in a performance bottleneck on a single node, which led to the above issues. Initially, it was thought to be due to other factors, but it turned out to be because of the missing disk.

| username: dba远航 | Original post link

It’s good that you’ve found the cause.

| username: Jellybean | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.