TiCDC Synchronization Lag Increasing

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: ticdc同步差距越来越大

| username: TiDBer_38aUQ8Ol

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.3.0
[Reproduction Path] Export backup from the primary database using dumpling, import the backup to the secondary database using tidb-lightning. Configuration uses ticdc for synchronization.
[Encountered Problem: Phenomenon and Impact] The delay is increasing. When configuring synchronization, specifying tso, there is a nearly 8-hour gap between the backup tso and the latest tso. If tso is not specified, the gap observed in changefeeds tso also increases.
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachment: Screenshot/Logs/Monitoring]

| username: TiDBer_38aUQ8Ol | Original post link

Because during the import, the data volume is around 170G. Exporting from the primary database and then importing into the new replica will definitely result in a TSO difference. In my new test, regardless of the TSO difference, I directly created the task with the default TSO, and the phenomenon remains the same. As shown in the image below, the second task is normal, while the first task has issues.

| username: Daniel-W | Original post link

Isn’t the default TSO synchronized from the current time?

| username: TiDBer_38aUQ8Ol | Original post link

Yes, because exporting data from the primary database and then importing it into the secondary database takes three hours (using tidb-lightning). Then I manually specified the TSO (the TSO from the export) to start the CDC task and found that the synchronization delay also increased. So I did a test without specifying the TSO, thinking that it would be fine to miss some data as long as it could keep up with the TSO in real-time, but the result was not as expected, and the delay also kept increasing.

| username: zhanggame1 | Original post link

Latency is also increasing and it has nothing to do with how you export the data, right?

| username: TiDBer_38aUQ8Ol | Original post link

What I mean is, could the time difference between export and import be causing the CDC delay to increase? However, after testing, I found that this should not be the issue. Now, I’m not sure what is causing the delay to increase, and the delay is getting worse.

| username: Daniel-W | Original post link

Your status is normal. Based on your description, can the increasing gap after specifying the TSO start task be understood as the TSO progressing relatively slowly, or is it completely stopped?

We need to check the Grafana changefeed monitoring, TiKV logs, and CDC logs for any relevant records to analyze the specific reason.

| username: TiDBer_38aUQ8Ol | Original post link

The status is all normal, but the TSO advances very slowly. As shown in the figure below, this is a screenshot of the changefeed monitoring after creating a task using the default TSO. It shows that the delay time was shortly after the task was created, and the red box at the back shows that the latest delay has already reached 14 hours. The task was created on October 10th, and the delay has been increasing since then.

I checked the TiKV service (it should be from the cluster’s KV logs, right? This is from the cluster). After creating the task using the default TSO, this is the only error log output after the task was created.

The CDC service (main cluster) logs do not have any error log output after creating the task using the default TSO.

| username: Daniel-W | Original post link

Check the logs of the main TiKV cluster and also take a look at the Sink write duration monitoring.

| username: Fly-bird | Original post link

How is the performance of your replica database?

| username: TiDBer_38aUQ8Ol | Original post link

Indeed, I checked the TiKV logs of the main cluster and found no error logs. The “sink write duration” metric is empty.

| username: TiDBer_38aUQ8Ol | Original post link

The configuration of the replica is half that of the primary. The primary kv nodes and db nodes are both 16c32g, while the replica uses 8c16g. Does the configuration matter much? I see from the monitoring of the replica cluster that the CPU and memory usage are not very high.

| username: 路在何chu | Original post link

show processlist
Check if the replica is executing operations from the primary database.

| username: 路在何chu | Original post link

Are there any abnormal error logs?

| username: TiDBer_38aUQ8Ol | Original post link

The replica did execute some operations, but not all of them, which is very strange. There are no errors reported in the logs yet.

| username: 像风一样的男子 | Original post link

Are all the CDC task logs normal?

| username: TiDBer_38aUQ8Ol | Original post link

Confirmed that the task started on the 10th was normal when checked. Yesterday, the task was delayed for more than 24 hours, and the CDC task already reported an error. I manually deleted it.

| username: xmlianfeng | Original post link

I encountered a similar issue. When synchronizing database tables through CDC in the same network segment of TiDB, the delay keeps increasing. I found a temporary solution: for tasks with delays exceeding 30 minutes, pause them and then resume. See if this works for you.

| username: TiDBer_38aUQ8Ol | Original post link

So, will it always be operated like this in the future? Pause every time it exceeds 30 minutes and then resume? This synchronization is a long-term task, and doing this every time is quite a headache :joy:

| username: 像风一样的男子 | Original post link

Are there enough CDC resources?