GC is not working properly

translator_bot · June 22, 2024, 4:39pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: gc 不能正常工作

| username: Hacker_ynbNppAC

【TiDB Usage Environment】Production Environment
【TiDB Version】4.0
【Reproduction Path】5 TiKV nodes, mistakenly deleted the directories of 2 nodes, one node had issues so it was removed and became offline. I want it to become tombstone status and then delete it.
【Encountered Issues: Symptoms and Impact】
【Resource Configuration】5 TiKV nodes, 3 PD nodes, 2 TiDB nodes
TiKV 2T/node
【Reproduction Path】5 TiKV nodes, mistakenly deleted the directories of 2 nodes, one node had issues, later space was insufficient, truncated a large table, and GC has not been executed
【Encountered Issues: Symptoms and Impact】
GC failed at resolve locks

[ERROR] [gc_worker.go:787] ["[gc worker] resolve locks failed"] [uuid=5cb549336b40001] [safePoint=417520979457343488] [error="loadRegion from PD failed, key: \""t\"\"x80\"\"x00\"\"x00\"\"x00\"\"x00\"\"x01m\"\"xcb_r\"\"xf8\"\"x00\"\"x00\"\"x00\"\"x01\"\"x8f\"\"xd7;\"", err: rpc error: code = Canceled desc = context canceled"] [errorVerbose="loadRegion from PD failed, key: \""t\"\"x80\"\"x00\"\"x00\"\"x00\"\"x00\"\"x01m\"\"xcb_r\"\"xf8\"\"x00\"\"x00\"\"x00\"\"x01\"\"x8f\"\"xd7;\"", err: rpc error: code = Canceled desc = context canceled\"ngithub.com/pingcap/tidb/store/tikv.(*RegionCache).loadRegion\"n\"tgithub.com/pingcap/tidb@/store/tikv/region_cache.go:621\"ngithub.com/pingcap/tidb/store/tikv.(*RegionCache).findRegionByKey\"n\"tgithub.com/pingcap/tidb@/store/tikv/region_cache.go:358\"ngithub.com/pingcap/tidb/store/tikv.(*RegionCache).LocateKey\"n\"tgithub.com/pingcap/tidb@/store/tikv/region_cache.go:318\"ngithub.com/pingcap/tidb/store/tikv.(

Read an article suggesting to modify the region-cache-ttl parameter

translator_bot · June 22, 2024, 4:39pm

| username: Hacker_ynbNppAC | Original post link

translator_bot · June 22, 2024, 4:39pm

| username: Hacker_ynbNppAC | Original post link

The log is as follows

translator_bot · June 22, 2024, 4:39pm

| username: wzf0072 | Original post link

How many replicas does your cluster have? To ensure that the Region has enough healthy replicas on other TiKVs to continue read and write operations through the Raft mechanism.

translator_bot · June 22, 2024, 4:39pm

| username: wzf0072 | Original post link

The [Online Unsafe Recovery] feature is not available in version 4.0;
Forcing Region to recover service from multi-replica failure state (deprecated)

translator_bot · June 22, 2024, 4:39pm

| username: Hacker_ynbNppAC | Original post link

I have three replicas, and now the problem is that the GC worker is not functioning properly.

translator_bot · June 22, 2024, 4:39pm

| username: Hacker_ynbNppAC | Original post link

Unable to process the request to truncate the table and reclaim data through GC.

translator_bot · June 22, 2024, 4:39pm

| username: 考试没答案 | Original post link

Check the gc_safe_point at that time. How long did you set it for? show variables like ‘%gc_life_time%’;

translator_bot · June 22, 2024, 4:39pm

| username: 考试没答案 | Original post link

After truncating a large table, there will be many empty regions that need to be merged. You can also check the status of the regions using pd-ctl.

translator_bot · June 22, 2024, 4:39pm

| username: Hacker_ynbNppAC | Original post link

translator_bot · June 22, 2024, 4:39pm

| username: Hacker_ynbNppAC | Original post link

I currently have quite a few empty regions, and I also have a newly added node, so it has relatively little data.
Regions lacking replicas: over 70,000
Regions with multiple replicas: 12
Regions with replicas in Pending state: 11