Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: 三台TIKV机器存储严重负载不均衡
[TiDB Usage Environment] Production Environment
[TiDB Version] 5.2.3
[Encountered Problem: Phenomenon and Impact]
Severe load imbalance among three TiKV storage machines
Logically, TiKV should balance the load itself. Why is there such a significant difference among my three machines?
The distribution of leaders and regions in tikv-details is balanced.
However, the problem is indeed that the data on store-1 is significantly less.
There is a large table that has been allocated to the store1 node, and its compression rate is much higher than other tables.

A large table? Doesn’t TiKV replicate data three times? My STORE1 has the least disk usage. How do you control the compression ratio? Why is the compression ratio of my store1 high?
Take a look at the region distribution of kv in the tsp-prod-tidb-cluster-Overview monitoring?
TiKV master and slave use the same compression algorithm, why is the effect different?
Currently, it seems that the compression rate of some files on the master is higher. This depends on the distribution of underlying data and the implementation of RocksDB. Occasional fluctuations in data size are normal, and the underlying storage engine will adjust the data as needed.
So my situation is normal? It’s just that the data compression ratio is different.
Is this the one? They all look the same to me.
If the data is balanced, check if there are other files occupying disk space, such as logs.
I see that your database version is a bit old, [TiDB version] 5.2.3, there might be some bugs. We are using TiDB 7.2 and haven’t encountered this issue.
It should still be a GC bug. Some nodes have issues with GC cleanup, leading to consistent data volume but large space occupation.
- Temporary solution: You can disable
gc.enable-compaction-filter
and restart the cluster.
- Permanent solution: Upgrade the TiDB cluster version for a permanent fix.
Try upgrading the cluster version.
It seems to be caused by GC, try upgrading the version.
You can check the Region health in the PD page to see if there are any empty regions.
Version 5.2.3 is too old, I suggest upgrading and giving it a try. It’s also a bit difficult to find support for older versions.
There should be a hotspot table. If it is less than 64MB, it can be made into a hotspot table cache.