GC is not working properly

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: gc 不能正常工作

| username: Hacker_ynbNppAC

【TiDB Usage Environment】Production Environment
【TiDB Version】4.0
【Reproduction Path】5 TiKV nodes, mistakenly deleted the directories of 2 nodes, one node had issues so it was removed and became offline. I want it to become tombstone status and then delete it.
【Encountered Issues: Symptoms and Impact】
【Resource Configuration】5 TiKV nodes, 3 PD nodes, 2 TiDB nodes
TiKV 2T/node
【Reproduction Path】5 TiKV nodes, mistakenly deleted the directories of 2 nodes, one node had issues, later space was insufficient, truncated a large table, and GC has not been executed
【Encountered Issues: Symptoms and Impact】
GC failed at resolve locks

[ERROR] [gc_worker.go:787] ["[gc worker] resolve locks failed"] [uuid=5cb549336b40001] [safePoint=417520979457343488] [error="loadRegion from PD failed, key: \""t\"\"x80\"\"x00\"\"x00\"\"x00\"\"x00\"\"x01m\"\"xcb_r\"\"xf8\"\"x00\"\"x00\"\"x00\"\"x01\"\"x8f\"\"xd7;\"", err: rpc error: code = Canceled desc = context canceled"] [errorVerbose="loadRegion from PD failed, key: \""t\"\"x80\"\"x00\"\"x00\"\"x00\"\"x00\"\"x01m\"\"xcb_r\"\"xf8\"\"x00\"\"x00\"\"x00\"\"x01\"\"x8f\"\"xd7;\"", err: rpc error: code = Canceled desc = context canceled\"ngithub.com/pingcap/tidb/store/tikv.(*RegionCache).loadRegion\"n\"tgithub.com/pingcap/tidb@/store/tikv/region_cache.go:621\"ngithub.com/pingcap/tidb/store/tikv.(*RegionCache).findRegionByKey\"n\"tgithub.com/pingcap/tidb@/store/tikv/region_cache.go:358\"ngithub.com/pingcap/tidb/store/tikv.(*RegionCache).LocateKey\"n\"tgithub.com/pingcap/tidb@/store/tikv/region_cache.go:318\"ngithub.com/pingcap/tidb/store/tikv.(

Read an article suggesting to modify the region-cache-ttl parameter

| username: Hacker_ynbNppAC | Original post link

| username: Hacker_ynbNppAC | Original post link

The log is as follows

| username: wzf0072 | Original post link

How many replicas does your cluster have? To ensure that the Region has enough healthy replicas on other TiKVs to continue read and write operations through the Raft mechanism.

| username: wzf0072 | Original post link

The [Online Unsafe Recovery] feature is not available in version 4.0;
Forcing Region to recover service from multi-replica failure state (deprecated)

| username: Hacker_ynbNppAC | Original post link

I have three replicas, and now the problem is that the GC worker is not functioning properly.

| username: Hacker_ynbNppAC | Original post link

Unable to process the request to truncate the table and reclaim data through GC.

| username: 考试没答案 | Original post link

Check the gc_safe_point at that time. How long did you set it for? show variables like ‘%gc_life_time%’;

| username: 考试没答案 | Original post link

After truncating a large table, there will be many empty regions that need to be merged. You can also check the status of the regions using pd-ctl.

| username: Hacker_ynbNppAC | Original post link

| username: Hacker_ynbNppAC | Original post link

I currently have quite a few empty regions, and I also have a newly added node, so it has relatively little data.
Regions lacking replicas: over 70,000
Regions with multiple replicas: 12
Regions with replicas in Pending state: 11



| username: 考试没答案 | Original post link

First, set the status correctly. Everything else should work fine.

Check if there are any issues with the schedule.

| username: Hacker_ynbNppAC | Original post link

Are you referring to the region status? The store status is normal.

| username: 考试没答案 | Original post link

display Check the status

| username: 考试没答案 | Original post link

Check the status of the mysql.analyze_status table.

| username: Hacker_ynbNppAC | Original post link

There is no such command in pd-ctl.

| username: Hacker_ynbNppAC | Original post link

I couldn’t find this table in version 4.0.

| username: Hacker_ynbNppAC | Original post link

| username: 考试没答案 | Original post link

tiup cluster display cluster_name

| username: Hacker_ynbNppAC | Original post link

I deployed using k8s, and currently, all nodes are functioning normally.