Abnormalities After Re-enabling tidb_gc_enable Six Months Later

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb_gc_enable 关了半年之后,重新开启异常

| username: TiDBer_yangxi

【TiDB Usage Environment】Production Environment / Testing / Poc
【TiDB Version】
【Reproduction Path】What operations were performed when the issue occurred
【Encountered Issue: Problem Phenomenon and Impact】
【Resource Configuration】Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
【Attachments: Screenshots/Logs/Monitoring】
tidb_gc_enable has been turned off for half a year…
mysql> show status like ‘%gc%’;
±----------------------±----------------------------------------------------------------------------------------------------------+
| Variable_name | Value |
±----------------------±----------------------------------------------------------------------------------------------------------+
| tidb_gc_last_run_time | 20231026-15:06:47.521 +0800 |
| tidb_gc_leader_desc | host:sjpt-dbdc-tidb9.dc.wxxdc, pid:37499, start at 2023-11-16 18:59:16.797709803 +0800 CST m=+1.565898972 |
| tidb_gc_leader_lease | 20240506-17:23:16.802 +0800 |
| tidb_gc_leader_uuid | 62f5f240ab00003 |
| tidb_gc_safe_point | 20231026-14:56:47.521 +0800 |
±----------------------±----------------------------------------------------------------------------------------------------------+
5 rows in set (0.20 sec)

mysql> show variables like ‘%gc%’;
±------------------------------------±-------+
| Variable_name | Value |
±------------------------------------±-------+
| tidb_enable_gc_aware_memory_track | OFF |
| tidb_enable_gogc_tuner | ON |
| tidb_gc_concurrency | -1 |
| tidb_gc_enable | OFF |
| tidb_gc_life_time | 10m0s |
| tidb_gc_max_wait_time | 86400 |
| tidb_gc_run_interval | 10m0s |
| tidb_gc_scan_lock_mode | LEGACY |
| tidb_gogc_tuner_threshold | 0.6 |
| tidb_server_memory_limit_gc_trigger | 0.7 |
±------------------------------------±-------+
After reopening:
mysql> show status like ‘%gc%’;
±----------------------±----------------------------------------------------------------------------------------------------------+
| Variable_name | Value |
±----------------------±----------------------------------------------------------------------------------------------------------+
| tidb_gc_last_run_time | 20240506-17:31:16.779 +0800 |
| tidb_gc_leader_desc | host:sjpt-dbdc-tidb9.dc.wxxdc, pid:37499, start at 2023-11-16 18:59:16.797709803 +0800 CST m=+1.565898972 |
| tidb_gc_leader_lease | 20240506-18:54:16.802 +0800 |
| tidb_gc_leader_uuid | 62f5f240ab00003 |
| tidb_gc_safe_point | 20240506-17:21:16.779 +0800 |

tidb@sjpt-dbdc-tidb7:~$ tiup ctl:v7.1.1 pd service-gc-safepoint
Starting component ctl: /home/tidb/.tiup/components/ctl/v7.1.1/ctl pd service-gc-safepoint
{
“service_gc_safe_points”: [
{
“service_id”: “gc_worker”,
“expired_at”: 9223372036854775807,
“safe_point”: 449573624683954176
}
],
“gc_safe_point”: 445200048461185024
}

tiup cdc:v7.1.1 cli --pd=http://10.2.***:2379 unsafe reset execution is invalid, restarting pd is also invalid
I see in the pd logs: failed to get safe point from pd"] [err_code=KV:Storage:Unknown] [err=“Error(Other(”[src/server/gc_worker/gc_worker.rs:80]: failed to get safe point from PD: Other("[components/pd_client/src/util.rs:427

tidb logs:
[“[gc worker] delete ranges: got an error while trying to get store list from PD”] [uuid=62f5f240ab00003] [error=“rpc error: code = Unavailable desc = not leader”]

| username: TiDBer_yangxi | Original post link

After a major GC in the early morning, the safepoint advanced.

| username: dba-kit | Original post link

Curious about how much data was finally cleaned up? You can turn off GC for half a year, and there were no issues…

| username: TiDBer_yangxi | Original post link

Missing a quarter – 900G

| username: TIDB-Learner | Original post link

OP, please explain the background and consequences. Why has GC been turned off for half a year? What needs are being met, and what impact will it have? For example, with so much garbage not being collected, what impact will it have on data and performance?

| username: 友利奈绪 | Original post link

Didn’t quite understand.

| username: 呢莫不爱吃鱼 | Original post link

Is it possible to play like this?

| username: TiDBer_yangxi | Original post link

I forgot to turn it back on after turning it off during data synchronization.

| username: kkpeter | Original post link

I’m also curious, doesn’t this affect performance over such a long time?

| username: 健康的腰间盘 | Original post link

It’s hopeless. Light a few incense sticks and sincerely worship. Maybe the machine spirit will be pleased and it will come back to life.

| username: TiDBer_rvITcue9 | Original post link

The almighty reboot method

| username: Qiuchi | Original post link

That’s awesome, man. So what made you think of it? Is it because it’s getting slower and slower?

| username: 这里介绍不了我 | Original post link

:sweat_smile: Fortunately, GC is working normally.

| username: ziptoam | Original post link

This poses quite a challenge to the automatic recycling mechanism.

| username: zhaokede | Original post link

There are still quite a lot of resources, and GC is only restarted every six months.

| username: zhaokede | Original post link

It is estimated that a single garbage collection will take a long time, especially when the database has a large amount of data and significant data changes.

| username: 扬仔_tidb | Original post link

From another perspective, TiDB is strong; it took half a year to even notice any slowness.

| username: Kongdom | Original post link

:muscle: I have to say TiDB is indeed stable. We had a TiKV node that was restarted and we forgot to turn off the firewall. As a result, even though the port was inaccessible, it ran normally for several months. We only discovered the issue because the GC backlog became too large.