Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.Original topic: ticdc执行极其缓慢,几个小时了才同步几分钟的数据

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version]
Upstream TiDB version 6.1.1, TiCDC also 6.1.1
Downstream TiDB version v7.1.2
[Reproduction Path] What operations were performed to encounter the issue
Both TiDB clusters use physical machines with SSDs, the downstream TiDB has no requests, while the upstream has business requests, and the ping delay between the two clusters is 50ms.
Currently, we want to migrate the upstream data to the downstream TiDB in full. We have already used dumpling+tidb-light for full synchronization. After the full synchronization is completed, there is about one day of incremental data using TiCDC for synchronization.
The data being synchronized consists of 3 databases, using 2 TiCDC tasks for synchronization, with one large database using a single TiCDC process.
[Encountered Issue: Problem Phenomenon and Impact]
The current issue is that the TiCDC process for the large database is synchronizing very slowly and cannot keep up with the incremental data from the upstream business. The Changefeed checkpoint lag is getting larger and larger.
- Changefeed checkpoint lag: This metric represents the data replication delay between the upstream TiDB cluster and the downstream system, measured in time units. This metric reflects whether the overall data synchronization status of the Changefeed is healthy. Generally, the smaller the lag, the better the synchronization task status. When the lag increases, it usually indicates that the synchronization capability of the Changefeed or the consumption capability of the downstream system cannot match the write speed of the upstream.
- Changefeed resolved ts lag: This metric represents the data delay between the upstream TiDB cluster and the TiCDC node, measured in time units. This metric can reflect the ability of the Changefeed to pull data changes from the upstream. When the lag increases, it indicates that the Changefeed cannot pull the data changes generated by the upstream in a timely manner.
Another point is that in the TiCDC process for the large database, one of the TiCDC nodes is keeping up, with the checkpoint lag remaining at 2 seconds, while the other node takes several hours to synchronize a few minutes.
Currently, there are no abnormalities in the logs, and the load on the downstream cluster is very low.
The TiCDC configuration is as follows, with other settings being default:
force-replicate = true
[mounter]
worker-num = 16
I want to know what the reason is and if there are any optimizations that can be made.
Is it related to the 50ms delay between the two clusters?
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]