TiCDC not synchronizing, checkpoint stuck

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: ticdc不同步了,checkpoint卡住不动

| username: TiDBer_pkQ5q1l0

[TiDB Usage Environment] Production Environment
[TiDB Version] 5.2.1
[Reproduction Path]
One of the upstream source tidb-server nodes experienced an OOM. After the OOM, ticdc got stuck and seemed to have encountered a BUG. Then I tried to delete the synchronization task and recreate it, but encountered the error [CDC:ErrStartTsBeforeGC] fail to create changefeed because start-ts 440847256865996803 is earlier than GC safepoint at 440856954345881600, making it impossible to recreate. Upon checking, the actual gc_safe_point is 440847256865996803. Why can’t it be created?

[Encountered Problem: Problem Phenomenon and Impact]
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

| username: TiDBer_pkQ5q1l0 | Original post link

Additionally, during the process of fixing the old synchronization task, I tried restarting CDC and the downstream TiDB server, but it had no effect. Then I set gc_life_time to 48h, thinking of redoing it the next day.

| username: db_user | Original post link

The reason for not being able to create is because the GC time has passed. For example, if your original gc_life_time is set to 1 hour, after 1 hour it will have already passed. You can check by running select * from mysql.tidb where variable_name like '%gc%'; to see if tikv_gc_safe_point is before or after the start-ts. If it is after, you will encounter an error when starting the task.

You can refer to the following for the conversion between TSO and time:

SELECT TIDB_PARSE_TSO(@@tidb_current_ts);
SELECT conv(concat(bin(unix_timestamp('2022-01-06 12:30:59') * 1000),'000000000000000001'),2,10);

As for why the checkpoint is not moving, you will need to provide the relevant logs and monitoring for CDC, PD, TiKV, and TiDB to possibly determine the cause.

| username: tidb菜鸟一只 | Original post link

When you modified gc_life_time, the gc_safe_point that CDC stayed at should have already expired. Your earliest gc_safe_point now should be 440856954345881600, but CDC recorded 440847256865996803. The data lost in between cannot be recovered.

| username: TiDBer_pkQ5q1l0 | Original post link

The question is that the gc_safe_point displayed in my system is at 20230417-10:33:45, which is still the moment it got stuck. In theory, if I convert this time point into TSO, it should be able to connect.

| username: TiDBer_pkQ5q1l0 | Original post link

Shouldn’t CDC being stuck cause GC to be hung for 24 hours?

| username: TiDBer_pkQ5q1l0 | Original post link

The image is not visible. Please provide the text you need translated.

| username: db_user | Original post link

For GC-related topics, you can check out these two articles:

| username: TiDBer_pkQ5q1l0 | Original post link

I couldn’t figure out the reason, so I just redid the recovery.