Severe Load Imbalance in Three TiKV Machines

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 三台TIKV机器存储严重负载不均衡

| username: TiDBer_bOR8eMEn

[TiDB Usage Environment] Production Environment
[TiDB Version] 5.2.3

[Encountered Problem: Phenomenon and Impact]
Severe load imbalance among three TiKV storage machines


Logically, TiKV should balance the load itself. Why is there such a significant difference among my three machines?

| username: DBAER | Original post link

The distribution of leaders and regions in tikv-details is balanced.

| username: TiDBer_bOR8eMEn | Original post link

However, the problem is indeed that the data on store-1 is significantly less.

| username: chenhanneu | Original post link

There is a large table that has been allocated to the store1 node, and its compression rate is much higher than other tables.
1715663285513

| username: TiDBer_bOR8eMEn | Original post link

A large table? Doesn’t TiKV replicate data three times? My STORE1 has the least disk usage. How do you control the compression ratio? Why is the compression ratio of my store1 high?

| username: 像风一样的男子 | Original post link

Take a look at the region distribution of kv in the tsp-prod-tidb-cluster-Overview monitoring?

| username: chenhanneu | Original post link

TiKV master and slave use the same compression algorithm, why is the effect different?

Currently, it seems that the compression rate of some files on the master is higher. This depends on the distribution of underlying data and the implementation of RocksDB. Occasional fluctuations in data size are normal, and the underlying storage engine will adjust the data as needed.

| username: TiDBer_bOR8eMEn | Original post link

So my situation is normal? It’s just that the data compression ratio is different.

| username: TiDBer_bOR8eMEn | Original post link

Is this the one? They all look the same to me.

| username: 像风一样的男子 | Original post link

If the data is balanced, check if there are other files occupying disk space, such as logs.

| username: TiDBer_bOR8eMEn | Original post link

There are no logs. I checked, it’s just that the database sizes are inconsistent.
image
image
image

| username: hacker_77powerful | Original post link

I see that your database version is a bit old, [TiDB version] 5.2.3, there might be some bugs. We are using TiDB 7.2 and haven’t encountered this issue.

| username: tidb菜鸟一只 | Original post link

It should still be a GC bug. Some nodes have issues with GC cleanup, leading to consistent data volume but large space occupation.

  1. Temporary solution: You can disable gc.enable-compaction-filter and restart the cluster.
  2. Permanent solution: Upgrade the TiDB cluster version for a permanent fix.
| username: 友利奈绪 | Original post link

Try upgrading the cluster version.

| username: 这里介绍不了我 | Original post link

It seems to be caused by GC, try upgrading the version.

| username: erwadba | Original post link

You can check the Region health in the PD page to see if there are any empty regions.

| username: 不想干活 | Original post link

Version 5.2.3 is too old, I suggest upgrading and giving it a try. It’s also a bit difficult to find support for older versions.

| username: TIDB-Learner | Original post link

Check the documentation

| username: TIDB-Learner | Original post link

| username: yytest | Original post link

There should be a hotspot table. If it is less than 64MB, it can be made into a hotspot table cache.