TiCDC Execution Extremely Slow, Only a Few Minutes of Data Synchronized After Several Hours

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: ticdc执行极其缓慢,几个小时了才同步几分钟的数据

| username: TiDBer_Q6zIfbhF

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version]
Upstream TiDB version 6.1.1, TiCDC also 6.1.1
Downstream TiDB version v7.1.2

[Reproduction Path] What operations were performed to encounter the issue
Both TiDB clusters use physical machines with SSDs, the downstream TiDB has no requests, while the upstream has business requests, and the ping delay between the two clusters is 50ms.
Currently, we want to migrate the upstream data to the downstream TiDB in full. We have already used dumpling+tidb-light for full synchronization. After the full synchronization is completed, there is about one day of incremental data using TiCDC for synchronization.
The data being synchronized consists of 3 databases, using 2 TiCDC tasks for synchronization, with one large database using a single TiCDC process.
[Encountered Issue: Problem Phenomenon and Impact]
The current issue is that the TiCDC process for the large database is synchronizing very slowly and cannot keep up with the incremental data from the upstream business. The Changefeed checkpoint lag is getting larger and larger.

  • Changefeed checkpoint lag: This metric represents the data replication delay between the upstream TiDB cluster and the downstream system, measured in time units. This metric reflects whether the overall data synchronization status of the Changefeed is healthy. Generally, the smaller the lag, the better the synchronization task status. When the lag increases, it usually indicates that the synchronization capability of the Changefeed or the consumption capability of the downstream system cannot match the write speed of the upstream.
  • Changefeed resolved ts lag: This metric represents the data delay between the upstream TiDB cluster and the TiCDC node, measured in time units. This metric can reflect the ability of the Changefeed to pull data changes from the upstream. When the lag increases, it indicates that the Changefeed cannot pull the data changes generated by the upstream in a timely manner.

Another point is that in the TiCDC process for the large database, one of the TiCDC nodes is keeping up, with the checkpoint lag remaining at 2 seconds, while the other node takes several hours to synchronize a few minutes.

Currently, there are no abnormalities in the logs, and the load on the downstream cluster is very low.
The TiCDC configuration is as follows, with other settings being default:
force-replicate = true
[mounter]
worker-num = 16

I want to know what the reason is and if there are any optimizations that can be made.
Is it related to the 50ms delay between the two clusters?
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

| username: TiDBer_Q6zIfbhF | Original post link

At present, it seems that the TiCDC process is not completely halted but extremely slow. It has been syncing for more than 20 hours, and the data has only been processed for 3 hours. The delay is getting longer and longer.

| username: TiDBer_Q6zIfbhF | Original post link

The default value of tidb_enable_clustered_index is INT_ONLY in version 5.0.0-5.1.0, and ON in version 5.2.0 and later.

| username: TiDBer_Q6zIfbhF | Original post link

  • Sink flush rows/s: The number of data changes output to the downstream per second by the Sink module in the TiCDC node. This metric reflects the synchronization rate of the sync task to the downstream. When Sink flush rows/s is less than Puller output events/s, the synchronization delay may increase.
| username: Fly-bird | Original post link

How is the write efficiency downstream?

| username: TiDBer_Q6zIfbhF | Original post link

The QPS is very low, only a few hundred, and there are no slow SQL queries. The machine’s resources also seem to be idle, and there are no errors in the logs…

| username: dba远航 | Original post link

Manually transfer a larger file from the upstream server to the downstream server to check if the transfer speed is normal, and then determine if it’s a TiDB configuration issue.

| username: TiDBer_Q6zIfbhF | Original post link

The speed is around 9MB/s, is it related to this?

The configuration of TiDB is all default settings.

| username: TiDBer_Q6zIfbhF | Original post link

Strange, there is a sudden drop here, and then it starts to rise again.

| username: tidb菜鸟一只 | Original post link

The file transfer speed is 9M/s. Are the machines in the two clusters not together? Why is the speed so slow?

| username: andone | Original post link

Is there a major transaction being executed?

| username: zhaokede | Original post link

The network speed between the two clusters is slow, you can also check the IO situation of the servers in both clusters.

| username: TiDBer_Q6zIfbhF | Original post link

They are not together, they are very far apart. There is a 50ms ping physical latency distance.
Now it is found that when the upstream exceeds 1000+ TPS, the synchronization becomes particularly slow.

Is there any way or configuration to speed it up? Otherwise, it can’t keep up.

| username: TiDBer_Q6zIfbhF | Original post link

The IO is relatively low. It has been found that TiCDC’s synchronization is completely unable to keep up with the upstream writes, which are over 1000 TPS.

| username: tidb菜鸟一只 | Original post link

The different versions are a bit troublesome. If it’s the same version, I suggest deploying CDC downstream. If it’s different versions, deploying downstream might cause issues.

| username: Jellybean | Original post link

We encountered a similar issue not long ago, with a QPS of only around 5k at the time. After optimization, it can reach 50k.

By optimizing the per-table-memory-quota parameter of the TiCDC process and adjusting the Sink worker-count concurrency for the TiCDC synchronization to the downstream TiDB cluster, we achieved a speed increase.

| username: Jellybean | Original post link

You can refer to my example and directly optimize it

| username: TiDBer_Q6zIfbhF | Original post link

Which document contains per-table-memory-quota? The downstream is TiDB, and I can’t find this parameter.

Is my version too low? ticdc v5.1.2

| username: Jellybean | Original post link

There was no documentation for the lower versions; it was discovered by looking at the source code. The documentation was only released in the higher versions.