V6.5.0 TiCDC occasionally experiences long delays in synchronization. It returns to normal immediately after restarting TiCDC

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: v6.5.0 ticdc同步偶尔存在长时间延迟。重启ticdc之后,立即恢复正常。

| username: erwadba

[TiDB Usage Environment] Production Environment / Testing / Poc
Production Environment
[TiDB Version]
v6.5.0
[Reproduction Path] What operations were performed when the issue occurred
Not sure how to reproduce.
[Encountered Issue: Phenomenon and Impact]
Phenomenon: Usually, there is almost no delay. The longest delay is about 1 minute. However, ticdc occasionally experiences long delays, with checkpointLag continuously increasing. After restarting ticdc, the delay immediately disappears.

[2023/02/15 20:31:30.208 +08:00] [INFO] [replication_manager.go:551] ["schedulerv3: slow table"] [namespace=default] [changefeed=repl-to-tenxunguangzhou] [tableID=2873] [tableStatus="region_count:10 current_ts:439475054874722304 stage_checkpoints:<key:\"puller-egress\" value:<checkpoint_ts:439474753250787343 resolved_ts:439474716537520134 > > stage_checkpoints:<key:\"puller-ingress\" value:<checkpoint_ts:439474753250787343 resolved_ts:439475054834876421 > > stage_checkpoints:<key:\"sink\" value:<checkpoint_ts:439474716537520134 resolved_ts:439474716537520134 > > stage_checkpoints:<key:\"sorter-egress\" value:<checkpoint_ts:439474716537520134 resolved_ts:439474716537520134 > > stage_checkpoints:<key:\"sorter-ingress\" value:<checkpoint_ts:439474753250787343 resolved_ts:439474716537520134 > > barrier_ts:439475054834876418 "] [checkpointTs=439474716537520134] [resolvedTs=439474716537520134] [checkpointLag=21m30.706968307s]
[2023/02/15 20:31:30.209 +08:00] [INFO] [replication_manager.go:551] ["schedulerv3: slow table"] [namespace=default] [changefeed=repl-to-zp] [tableID=3082] [tableStatus="region_count:1 current_ts:439475054887567360 stage_checkpoints:<key:\"puller-egress\" value:<checkpoint_ts:439474977803337743 resolved_ts:439474977803337743 > > stage_checkpoints:<key:\"puller-ingress\" value:<checkpoint_ts:439474977803337743 resolved_ts:439474977803337743 > > stage_checkpoints:<key:\"sink\" value:<checkpoint_ts:439474977737801745 resolved_ts:439474977737801745 > > stage_checkpoints:<key:\"sorter-egress\" value:<checkpoint_ts:439474977737801745 resolved_ts:439474977737801745 > > stage_checkpoints:<key:\"sorter-ingress\" value:<checkpoint_ts:439474977737801745 resolved_ts:439474977737801745 > > barrier_ts:439475054834876418 "] [checkpointTs=439474977737801745] [resolvedTs=439474977803337743] [checkpointLag=4m54.307444085s]
[2023/02/15 20:31:30.209 +08:00] [INFO] [replication_manager.go:551] ["schedulerv3: slow table"] [namespace=default] [changefeed=repl-to-zp] [tableID=3040] [tableStatus="region_count:1 current_ts:439475054888091648 stage_checkpoints:<key:\"puller-egress\" value:<checkpoint_ts:439474977803337743 resolved_ts:439474977803337743 > > stage_checkpoints:<key:\"puller-ingress\" value:<checkpoint_ts:439474977803337743 resolved_ts:439474977803337743 > > stage_checkpoints:<key:\"sink\" value:<checkpoint_ts:439474977737801745 resolved_ts:439474977737801745 > > stage_checkpoints:<key:\"sorter-egress\" value:<checkpoint_ts:439474977737801745 resolved_ts:439474977737801745 > > stage_checkpoints:<key:\"sorter-ingress\" value:<checkpoint_ts:439474977737801745 resolved_ts:439474977737801745 > > barrier_ts:439475054834876418 "] [checkpointTs=439474977737801745] [resolvedTs=439474977803337743] [checkpointLag=4m54.307444085s]
[2023/02/15 20:31:30.209 +08:00] [INFO] [replication_manager.go:551] ["schedulerv3: slow table"] [namespace=default] [changefeed=repl-to-zp] [tableID=2887] [tableStatus="region_count:5 current_ts:439475054887567360 stage_checkpoints:<key:\"puller-egress\" value:<checkpoint_ts:439474753250787343 resolved_ts:439474732082659333 > > stage_checkpoints:<key:\"puller-ingress\" value:<checkpoint_ts:439474753250787343 resolved_ts:439475054873673746 > > stage_checkpoints:<key:\"sink\" value:<checkpoint_ts:439474732082659333 resolved_ts:439474732082659333 > > stage_checkpoints:<key:\"sorter-egress\" value:<checkpoint_ts:439474732082659333 resolved_ts:439474732082659333 > > stage_checkpoints:<key:\"sorter-ingress\" value:<checkpoint_ts:439474753250787343 resolved_ts:439474732082659333 > > barrier_ts:439475054834876418 "] [checkpointTs=439474732082659333] [resolvedTs=439474732082659333] [checkpointLag=20m31.407444085s]

[Resource Configuration]
cdc.zip (612.6 KB)

[Attachments: Screenshots/Logs/Monitoring]
The attached log is the cdc owner log from 2023/02/15 20:00 to 21:20.

| username: weiyang | Original post link

Write a piece of data upstream, and then check downstream to see if the data is indeed synchronized after such a long delay?
First, rule out any issues with the monitoring display.

| username: erwadba | Original post link

Currently, the synchronization is normal with no delay. This large delay phenomenon is occasional. I’ll test it next time it happens.

| username: ealam_小羽 | Original post link

I would like to ask, after a long delay TiCDC restarts, is there any data loss in the synchronized data itself?

| username: erwadba | Original post link

I have verified several times using sync-diff-inspector. The results are consistent, so there should be no data loss.

| username: ealam_小羽 | Original post link

It might indeed be necessary to wait until it recurs to investigate further:
(1) As mentioned above, check if there is an actual data delay, but based on your description, it doesn’t seem likely.
(2) Query the actual changefeed status on the machine where TiCDC is located; there might be an issue with reporting.

| username: neilshen | Original post link

Hello, could you upload the complete TiCDC monitoring and logs of all TiCDC nodes?

| username: erwadba | Original post link

At the time of 2023/02/15, the ticdc monitoring data was no longer available. Prometheus did not retain data for that long. On 2023/02/23 from 19:00 to 20:00, there was a 30-minute delay. A test table was created, and data was inserted, which immediately resolved the delay without restarting cdc. I will upload yesterday’s logs and monitoring information.

cdc.zip (1.3 MB)

The attachment contains the json file for ticdc’s grafana. The export was done according to this document:
[FAQ] Export and Import of Grafana Metrics Page - :milky_way: Operations Guide / TiDB Common FAQ - TiDB Q&A Community (asktug.com)

There were indeed large transactions in the database at that time. First, resolve the large transactions and then observe if there are still occasional large delays.

| username: Minorli-PingCAP | Original post link

TiCDC latency is generally caused by large transactions. The immediate recovery after a restart you mentioned seems like the synchronization has already occurred, but there is an issue with the status feedback from CDC to the monitoring system.

| username: sdojjy | Original post link

No, the synchronization progress is stored in etcd. After restarting, it will pull data again from that point in time.