Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: 失败的cdc同步任务阻塞gc
[TiDB Usage Environment] Production Environment
[TiDB Version] 5.1.0
[Issue and Impact]
The early cluster once used the ticdc component to synchronize data to Kafka. Later, after cdc was deprecated, it was not scaled down.
Recently, it was discovered that the data volume of the cluster was somewhat abnormal. The backup was only a few GB, but the disk usage was as high as 3TB.
So I looked for the reason and found that the gc process could not advance the gc safepoint. I checked the cdc changefeed list and found that the checkpoint of this stopped synchronization was consistent with the time when the gc safepoint stopped.
I tried to directly remove the changefeed and found that it showed successful deletion but reappeared after a few seconds. Now I have directly scaled down the cdc component and restarted the tidb node of the gc leader. After the gc leader switched, gc still could not advance. The logs indicate that gc is still blocked.
Currently, I plan to directly restart the entire tidb component, and if it is ineffective, I plan to restart the entire cluster.
However, since the business has high availability requirements for the cluster, I would like to ask first if restarting only the tidb component or pd component can solve this problem.
You can check this 专栏 - 一场由TiCDC异常引发的GC不干活导致的Tikv硬盘使用问题 | TiDB 社区. Specifically, take a look at the error log of the GC leader. Is it also like this [“[gc worker] there’s another service in the cluster requires an earlier safe point. gc will continue with the earlier one”]?
The issue is similar, but I haven’t removed this CDC task, and even force doesn’t work.
This morning’s CDC scale-down hasn’t recovered GC yet.
I think restarting the cluster might help, but since the business demands high availability, I want to know if there’s a more reliable way.
Ps:
The CDC component has been removed, and the changefeed list can no longer be queried in PD. Currently, there is only one binlog drainer component in the cluster that is synchronizing its savepoint normally.
You can try scaling CDC back up and then remove the CDC task.
May I ask if there are any other ways to deal with this stubborn cdc cli tool that cannot be deleted? I will also take a look at the relevant code.
I feel that this change feed cannot be deleted is a BUG. After deletion, the owner of the capture changes, and then the new owner continues to report gcTTL errors. The faulty change feed comes back again. I will raise an issue to see if there are any similar cases.
If the TiCDC component has already been removed from the cluster, you can directly use the following command to clear the leftover TiCDC service GC safepoint:
tiup cdc:v5.1.0 cli --pd=<PD_ADDRESS> unsafe reset
Thanks for the reply, I’ll give it a try.
Can this remove the GC blocking? Do I still need to file an issue? Have there been similar issues before?
There have been such issues, which have been resolved in the new version. You can refer to this article: 专栏 - 一场由TiCDC异常引发的GC不干活导致的Tikv硬盘使用问题 | TiDB 社区
Is this “unsafe reset” only for clearing the CDC’s GC information?
This should reinitialize all the CDC metadata, right?
No, this not only clears the TiCDC GC information but also removes all other tasks (changefeed) related to TiCDC.
This topic was automatically closed 1 minute after the last reply. No new replies are allowed.