Issue of Disk Space Not Being Released After Deleting Large Tables in TiDB

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb删除大表磁盘空间不释放问题

| username: Hacker_cCDQit0H

[TiDB Usage Environment] Production Environment
[TiDB Version] 5.0.2
[Encountered Problem]
The cluster size has increased dramatically. After expanding multiple TiKV nodes, the available space on the TiKV node disks continues to decrease. Even after truncating large tables in the business, the available space still decreases. After manually restarting TiKV, a large amount of disk space is freed, with about 5T of space released across three TiKV nodes. However, this issue only occurs in this particular cluster. So, I would like to ask if anyone has encountered similar cases and what might be causing this problem?

[Supplement]
The business SQL execution method changed from:
set autocommit=0;
begin;
insert into xxx on duplicate key update; 1 entry
commit;
Concurrent execution of the above SQL

To concurrent execution of the following SQL:
set autocommit=0;
begin;
insert into xxx on duplicate key update; 1st entry
insert into xxx on duplicate key update; 2nd entry

insert into xxx on duplicate key update; 50th entry
commit;
I wonder if this change caused the issue.

| username: xfworld | Original post link

What is the GC interval set to?

Have you executed delete?

| username: TiDBer_pkQ5q1l0 | Original post link

Is there a timeout error for PD scheduling?

| username: Hacker_cCDQit0H | Original post link

After 10 minutes of disk tension, the business executed a truncate.

| username: xfworld | Original post link

Didn’t you set up an alert? It’s best to set a threshold for disk space…

Truncating to reclaim space takes some time, it’s not that quick.
If resources are insufficient, it’s recommended to add more nodes… (in case of emergency)

| username: dbaspace | Original post link

Is it a TiKV flow control issue? You can check the TiKV logs.

| username: Hacker_cCDQit0H | Original post link

After setting up the relevant alerts, we expanded the TiKV nodes once the alert threshold was reached. However, the data grew too quickly (almost 1TB per day), and the size of the business tables (logical size) did not increase significantly, so the initial troubleshooting direction was not very clear. Later, after truncating, the space was not reclaimed for 3-4 days, leading us to suspect a GC issue. Restarting the instance released the space, but the cluster generally has regular delete operations, and there were no historical GC issues. So, I wanted to ask if anyone has encountered similar problems.

| username: Hacker_cCDQit0H | Original post link

There are no related errors.

| username: Hacker_cCDQit0H | Original post link

Okay, thank you.

| username: xfworld | Original post link

Check the health of the table; having too many versions can lead to data backlog. However, after GC, this space will be quickly released, so there’s a balancing process here.

So, my understanding is: Is your scenario that the current node resources cannot meet the pressure of business writes?

Another direction: For resource recovery, you can check it. You can look up the table after you truncate it, the related regions, and the historical data version information under the regions.

Indeed, there is a bug in GC, you can check it out:

I suggest you upgrade to 5.0.6…

| username: Hacker_cCDQit0H | Original post link

Alright, thank you.

| username: Hacker_cCDQit0H | Original post link

The write pressure of the business indeed couldn’t be met, so they merged the transactions, haha.

| username: WalterWj | Original post link

If there are no issues after changing the GC time and other settings, it is recommended to upgrade. It might be a bug in the compact filter of the lower version.

| username: Hacker_cCDQit0H | Original post link

Yes, thank you. I’ve finally decided to upgrade.

| username: 魔礼养羊 | Original post link

If possible,
you should log in to the dashboard,

  1. Take screenshots showing the CPU/memory/disk utilization of each node.
  2. Screenshot the warm and other alert logs.

A sudden surge in TB-level writes is probably not due to a product bug. It’s better to check the logs first.

| username: 特雷西-迈克-格雷迪 | Original post link

It’s best to upgrade to avoid numerous pitfalls.