Issues with Raft Log Cleanup

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: raft log清理问题

| username: mono

[TiDB Usage Environment] Production Environment
[TiDB Version]: 5.4.2

3 servers, each deploying 2 TiKV instances, each instance using a separate hard drive. It was found that the raftlog of one TiKV instance occupies 160G of space, while the other 5 TiKV instances each only occupy a few gigabytes. Below are the logs. What could be the reason? How to handle it?

-rw-r–r-- 1 root root 129M May 14 15:17 0000000000000001.rewrite
-rw-r–r-- 1 root root 15M Oct 18 09:38 0000000000000002.rewrite
-rw-r–r-- 1 root root 129M Sep 25 08:55 0000000000033474.raftlog
-rw-r–r-- 1 root root 129M Sep 25 09:35 0000000000033475.raftlog
-rw-r–r-- 1 root root 129M Sep 25 10:10 0000000000033476.raftlog
-rw-r–r-- 1 root root 129M Sep 25 10:40 0000000000033477.raftlog
-rw-r–r-- 1 root root 129M Sep 25 11:06 0000000000033478.raftlog
-rw-r–r-- 1 root root 129M Sep 25 11:32 0000000000033479.raftlog
-rw-r–r-- 1 root root 129M Sep 25 11:58 0000000000033480.raftlog
-rw-r–r-- 1 root root 129M Sep 25 12:21 0000000000033481.raftlog
-rw-r–r-- 1 root root 129M Sep 25 12:46 0000000000033482.raftlog
-rw-r–r-- 1 root root 129M Sep 25 13:10 0000000000033483.raftlog
-rw-r–r-- 1 root root 129M Sep 25 13:35 0000000000033484.raftlog
-rw-r–r-- 1 root root 129M Sep 25 14:02 0000000000033485.raftlog
-rw-r–r-- 1 root root 129M Sep 25 14:25 0000000000033486.raftlog
-rw-r–r-- 1 root root 129M Sep 25 14:50 0000000000033487.raftlog
-rw-r–r-- 1 root root 129M Sep 25 15:15 0000000000033488.raftlog
-rw-r–r-- 1 root root 129M Sep 25 15:43 0000000000033489.raftlog
-rw-r–r-- 1 root root 129M Sep 25 16:10 0000000000033490.raftlog
-rw-r–r-- 1 root root 129M Sep 25 16:39 0000000000033491.raftlog
-rw-r–r-- 1 root root 129M Sep 25 17:07 0000000000033492.raftlog
-rw-r–r-- 1 root root 129M Sep 25 17:35 0000000000033493.raftlog

| username: 像风一样的男子 | Original post link

You can set a few parameters.

| username: mono | Original post link

The configurations of each TiKV are the same. Only this one TiKV instance has this issue, retaining many raft logs without automatic cleanup. The other nodes are normal.

| username: 像风一样的男子 | Original post link

TiKV has two underlying RocksDB instances, one for storing Raft logs and the other for storing data. When TiKV writes data, it first uses Raft logs to maintain data consistency among the various TiKV instances. The Raft logs are the logs recorded by the Raft protocol. If you find them too large, you can set parameters to have them automatically cleaned up.

| username: mono | Original post link

Yes. The current issue is that only one instance of TiKV has retained over 1000 raft logs, while the others have only a few dozen. I set the raft-log-gc-count-limit to 200, but it had no effect. I am not sure what is causing this?

| username: 像风一样的男子 | Original post link

This is the KV configuration file. After making changes, you need to restart KV.

| username: mono | Original post link

This TiKV parameter supports online modification.

| username: tidb菜鸟一只 | Original post link

Why can’t I find this type of file on my machine…

| username: 随缘天空 | Original post link

There are several reasons for this:

  1. Frequent data writes: If data write operations on a particular TiKV instance are very frequent, the amount of raftlog writes will also increase accordingly. This may cause the raftlog of that instance to occupy more space.

  2. Uneven data distribution: If the data distribution in the cluster is uneven, a particular TiKV instance may be responsible for handling more data write requests, leading to its raftlog occupying more space.

To address this issue, you can consider the following methods:

  1. Check data distribution: Use TiDB Dashboard or other monitoring tools to check the data distribution in the cluster. Ensure that data is evenly distributed across all TiKV instances to avoid overloading any single instance.

  2. Adjust scheduling strategy: If you find that a particular TiKV instance is overloaded, consider adjusting the scheduling strategy to distribute the load more evenly among other instances.

  3. Adjust TiKV configuration: Depending on the actual situation, you may need to adjust some configuration parameters of TiKV, such as raft_log_gc_threshold and raft_log_gc_tick_interval, to control the size and cleanup strategy of the raftlog.

| username: mono | Original post link

I checked it today. It has already been automatically cleaned up. A few days ago, we were updating data in batches, with several billion rows. It has 3 replicas. To increase, at least 3 instances should be increased. But it’s just this one instance. Really strange!

| username: Kongdom | Original post link

It is estimated that the data has all fallen onto this instance, causing a hotspot.