Issues Encountered Due to Busy TiKV Garbage Collection

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIKV GC繁忙出见问题

| username: TiDBer_bOR8eMEn

GC is busy and access keeps timing out. How to solve this?

| username: Billmay表妹 | Original post link

Did you set anything?

| username: TiDBer_bOR8eMEn | Original post link

Didn’t set it up. What should I do in this situation? The cluster keeps timing out.

| username: Billmay表妹 | Original post link

【TiDB Usage Environment】Production Environment / Testing / POC
【TiDB Version】
【Reproduction Path】What operations were performed when the issue occurred
【Encountered Issue: Issue Phenomenon and Impact】
【Resource Configuration】Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
【Attachments: Screenshots / Logs / Monitoring】

Describing the issue as required will help you better pinpoint the problem~

| username: TiDBer_小阿飞 | Original post link

  1. Temporary solution: Disable gc.enable-compaction-filter and restart the cluster.
  2. Permanent solution: Upgrade the TiDB cluster version for a permanent fix.

GC worker false alarm busy causing drop/truncate table space not to be reclaimed

Problem Description:

During the period when TiKV GC worker CPU usage is at 100%, executing drop table or truncate table commands may encounter the issue where TiKV space is not reclaimed after deleting the table. Even after the GC worker CPU usage decreases, subsequent drop table or truncate table operations still do not reclaim space.

GitHub issue: False GcWorkerTooBusy caused by incorrect scheduled_tasks · Issue #11903 · tikv/tikv · GitHub

Affected Versions:

v5.0.6, v5.1.3, v5.2.3, v5.3.0

Troubleshooting Steps:

  1. TiDB monitoring shows continuous send failures in GC - Delete Range Failure OPM, as shown in the figure:

  1. Confirm the reason for Delete Range errors in TiDB logs is “gc worker is too busy.”
  2. From a theoretical perspective, check if TiKV has experienced a situation where the GC worker CPU was continuously at 100%.

Cause of the Problem:

The drop table and truncate table commands in TiDB send unsafe destroy range requests to TiKV to delete a range of data.

When the TiKV GC worker is busy, the number of pending tasks for the GC worker may reach its limit. At this time, if unsafe destroy range tasks are added, the task counter may incorrectly increase but not decrease.

After multiple such operations, the value of this counter will permanently exceed the busy threshold of the GC worker. Subsequently, all unsafe destroy range requests will be rejected by TiKV, causing the drop/truncate table operations to fail in deleting data.

Workarounds:

  1. If the current TiKV GC worker CPU usage is not high, you can restart the TiKV instance to reset the erroneous counter and temporarily avoid the issue.
  2. Avoid executing drop table/truncate table operations when the TiKV GC worker CPU usage is high.

Fixed Versions:

v5.0.7, v5.1.4, v5.3.1, v5.4.0

Bugfix PR: https://github.com/tikv/tikv/pull/11904

| username: TiDBer_bOR8eMEn | Original post link

Error reported when restarting the cluster

| username: DBAER | Original post link

Refer to the community article

| username: yytest | Original post link

Is it a production environment? If it prompts that the worker is busy, it is recommended to check the resource usage at the operating system level.

| username: 小于同学 | Original post link

Upgrade it.

| username: TiDBer_bOR8eMEn | Original post link

Is it because my TiDB 5.2.3 has a bug?

| username: yytest | Original post link

You can try restarting the TiDB server module.