TiCDC Synchronizing Data to MySQL: Unexplained Increasing Latency

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiCDC同步数据到MySQL不知道什么原因延迟的不断增大

| username: seiang

【TiDB Usage Environment】Production
【TiDB Version】
V5.0.3
【Encountered Problem】
The TiCDC task synchronizing MySQL has been running for over a month, but this morning, the synchronization delay started to increase continuously. However, the data write volume of the upstream TiDB cluster has not changed, and there are no changes in the downstream MySQL either.


CDC node logs are as follows:
cdc.log.tar.gz (5.8 MB)


【Reproduction Path】
【Problem Phenomenon and Impact】
【Attachments】

  • Relevant logs, configuration files, Grafana monitoring (https://metricstool.pingcap.com/)
  • TiUP Cluster Display information
  • TiUP Cluster Edit config information
  • TiDB-Overview monitoring
  • Corresponding module Grafana monitoring (if any, such as BR, TiDB-binlog, TiCDC, etc.)
  • Corresponding module logs (including logs from 1 hour before and after the issue)

For questions related to performance optimization and troubleshooting, please download the script and run it. Please select all and copy-paste the terminal output results for upload.

| username: polars | Original post link

We encountered a similar issue with version 5.3. Suddenly, the latency increased significantly, and then it stopped syncing. The memory usage was also particularly high, reaching up to 50GB.

| username: seiang | Original post link

Is it also synchronized to MySQL through CDC? Did you identify the specific cause? How did you finally resolve it?

| username: dba-kit | Original post link

I previously encountered a situation where executing a large transaction on TiDB (backing up a table using insert into select) caused TiCDC to experience continuous delays. Looking at your monitoring information, Unified Sort on disk keeps increasing, which still seems to indicate a large transaction.

| username: dba-kit | Original post link

At that time, I first identified which table was causing the delay, then modified the changfeed to temporarily stop syncing that table and prioritized restoring the sync for other tables. I used Dumpling to export that table and created a separate changfeed to sync it, thereby skipping the large transaction. Once the business data caught up, I planned to find a time to merge the two changfeeds.

| username: seiang | Original post link

It should not be a transaction issue. I only synchronized one table, and this table only has insert operations. The Unified Sort on disk keeps growing and continues to increase, and I can’t pinpoint the exact reason for this.

| username: 代晓磊_Mars | Original post link

Check if there are any large index addition DDL operations on the upstream TiDB synchronization table? Or if a DDL operation was just performed.

| username: 代晓磊_Mars | Original post link

Execute cdc cli changefeed query -s --pd=http://10.0.10.25:2379 --changefeed-id=simple-replication-task to check if there are any errors in the current synchronization task.

| username: tidb狂热爱好者 | Original post link

It should be a large transaction. CDC is blocked by the transaction. TiDB’s table structure modification is online DDL, which is completed very quickly, unlike MySQL. With such synchronization, when the data volume increases, MySQL cannot handle it. This just proves the excellence of TiDB.

| username: ShawnYan | Original post link

Starting from 6:34, a large number of these logs appeared. Did something happen to TiKV at this time?

[2022/06/28 06:34:00.826 +08:00] [WARN] [client.go:1543] ["region not receiving resolved event from tikv or resolved ts is not pushing for too long time, try to resolve lock"] [regionID=4149003] [span="[7480000000000004ff415f72800000000
9ff1900410000000000fa, 7480000000000004ff415f728000000009ff1c9a5f0000000000fa)"] [duration=21.35s] [lastEvent=132.861559ms] [resolvedTs=434207252666122299]
[2022/06/28 06:34:00.830 +08:00] [WARN] [client.go:1543] ["region not receiving resolved event from tikv or resolved ts is not pushing for too long time, try to resolve lock"] [regionID=4107597] [span="[7480000000000004ffbf5f72800000000
1ff5cdfe10000000000fa, 7480000000000004ffbf5f728000000001ff607c870000000000fa)"] [duration=21.35s] [lastEvent=136.812613ms] [resolvedTs=434207252666122299]
[2022/06/28 06:34:00.836 +08:00] [WARN] [client.go:1543] ["region not receiving resolved event from tikv or resolved ts is not pushing for too long time, try to resolve lock"] [regionID=4044439] [span="[7480000000000004ffbb5f72800000000
3ff3564880000000000fa, 7480000000000004ffbb5f728000000003ff3902770000000000fa)"] [duration=21.35s] [lastEvent=141.137059ms] [resolvedTs=434207252666122299]
| username: seiang | Original post link

The synchronized table did not perform any DDL operations today, and the synchronization task status is also normal:
$ tiup ctl:v5.0.3 cdc changefeed query -s --pd=http://10.30.xx.xx:2379 --changefeed-id=xxxs-sync-task
Starting component ctl: /home/tidb/.tiup/components/ctl/v5.0.3/ctl cdc changefeed query -s --pd=http://10.30.xx.xx:2379 --changefeed-id=xxxs-sync-task
{
“state”: “normal”,
“tso”: 434211597270384728,
“checkpoint”: “2022-06-28 11:09:52.810”,
“error”: null
}

| username: seiang | Original post link

At 6:34, there were no anomalies with TiKV. According to the monitoring, the TiCDC synchronization delay started to gradually increase around 08:20 in the morning.

Moreover, the delay is getting larger and larger now.

| username: dba-kit | Original post link

Is there MySQL monitoring? Is it possible that the downstream MySQL performance is insufficient?

| username: db_user | Original post link

I suggest checking the MySQL general log, binary log, or doing a tcpdump on port 3306. I previously encountered a situation where a developer was performing a large number of deletes, which blocked other operations. This was detected through tcpdump monitoring.

| username: ShawnYan | Original post link

However, from your log, the QPS was 0 the day before yesterday and yesterday, and data only started coming in from early this morning.

[2022/06/27 23:31:07.247 +08:00] ... [changefeed=xxx-task]  [count=0] [qps=0]
[2022/06/27 23:41:07.247 +08:00] ... [changefeed=xxx-task]  [count=0] [qps=0]
[2022/06/27 23:51:07.248 +08:00] ... [changefeed=xxx-task]  [count=0] [qps=0]
[2022/06/28 00:01:07.248 +08:00] ... [changefeed=xxx-task]  [count=119204] [qps=198]
[2022/06/28 00:11:07.249 +08:00] ... [changefeed=xxx-task]  [count=1008319] [qps=1680]
[2022/06/28 00:21:07.272 +08:00] ... [changefeed=xxx-task]  [count=932367] [qps=1553]
| username: seiang | Original post link

The data has always been synchronized because there are two CDCs. In the log of another CDC node, you can see the previous data synchronization, and the QPS is not zero.

 [2022/06/27 02:21:00.785 +08:00] [INFO] [statistics.go:126] ["sink replication status"] [name=mysql] [changefeed=xxxx-task] [capture=xx.xx.xx.108:8300] [count=429212] [qps=715]
[2022/06/27 02:31:00.789 +08:00] [INFO] [statistics.go:126] ["sink replication status"] [name=mysql] [changefeed=xxxx-task] [capture=xx.xx.xx.108:8300] [count=404536] [qps=674]
[2022/06/27 02:41:00.791 +08:00] [INFO] [statistics.go:126] ["sink replication status"] [name=mysql] [changefeed=xxxx-task] [capture=xx.xx.xx.108:8300] [count=384137] [qps=640]
[2022/06/27 02:51:00.797 +08:00] [INFO] [statistics.go:126] ["sink replication status"] [name=mysql] [changefeed=xxxx-task] [capture=xx.xx.xx.108:8300] [count=365038] [qps=608]
[2022/06/27 03:01:00.797 +08:00] [INFO] [statistics.go:126] ["sink replication status"] [name=mysql] [changefeed=xxxx-task] [capture=xx.xx.xx.108:8300] [count=350421] [qps=584]
[2022/06/27 03:11:00.837 +08:00] [INFO] [statistics.go:126] ["sink replication status"] [name=mysql] [changefeed=xxxx-task] [capture=xx.xx.xx.108:8300] [count=334392] [qps=557]
[2022/06/27 03:21:00.848 +08:00] [INFO] [statistics.go:126] ["sink replication status"] [name=mysql] [changefeed=xxxx-task] [capture=xx.xx.xx.108:8300] [count=325291] [qps=542]
[2022/06/27 03:31:00.904 +08:00] [INFO] [statistics.go:126] ["sink replication status"] [name=mysql] [changefeed=xxxx-task] [capture=xx.xx.xx.108:8300] [count=312800] [qps=521]
[2022/06/27 03:41:00.904 +08:00] [INFO] [statistics.go:126] ["sink replication status"] [name=mysql] [changefeed=xxxx-task] [capture=xx.xx.xx.108:8300] [count=302907] [qps=504]
[2022/06/27 03:51:00.909 +08:00] [INFO] [statistics.go:126] ["sink replication status"] [name=mysql] [changefeed=xxxx-task] [capture=xx.xx.xx.108:8300] [count=293315] [qps=488]
[2022/06/27 04:01:00.909 +08:00] [INFO] [statistics.go:126] ["sink replication status"] [name=mysql] [changefeed=xxxx-task] [capture=xx.xx.xx.108:8300] [count=288179] [qps=480]
[2022/06/27 04:11:00.910 +08:00] [INFO] [statistics.go:126] ["sink replication status"] [name=mysql] [changefeed=xxxx-task] [capture=xx.xx.xx.108:8300] [count=279882] [qps=466]
[2022/06/27 04:21:00.910 +08:00] [INFO] [statistics.go:126] ["sink replication status"] [name=mysql] [changefeed=xxxx-task] [capture=xx.xx.xx.108:8300] [count=278376] [qps=463]
[2022/06/27 04:31:00.921 +08:00] [INFO] [statistics.go:126] ["sink replication status"] [name=mysql] [changefeed=xxxx-task] [capture=xx.xx.xx.108:8300] [count=275506] [qps=459]
[2022/06/27 04:41:00.921 +08:00] [INFO] [statistics.go:126] ["sink replication status"] [name=mysql] [changefeed=xxxx-task] [capture=xx.xx.xx.108:8300] [count=277555] [qps=462]
[2022/06/27 04:51:00.922 +08:00] [INFO] [statistics.go:126] ["sink replication status"] [name=mysql] [changefeed=xxxx-task] [capture=xx.xx.xx.108:8300] [count=279637] [qps=466

Below is the log of another CDC node:
cdc.log.tar.gz (15.8 MB)

| username: seiang | Original post link

The issue has been resolved, it was caused by the performance problem of the downstream MySQL.

Thanks to @ dba-kit @ 代晓磊_Mars @ ShawnYan @ 蒋明tidb狂热爱好者 for providing the troubleshooting ideas.

| username: BraveChen | Original post link

I am thinking…

| username: BraveChen | Original post link

Can you post what the issue is with the downstream MySQL?

| username: dba-kit | Original post link

Hahaha, indeed.
If TiDB confirms that there are no changes and no large transactions, then let’s change direction and investigate the performance of downstream MySQL. Is it possible that the write speed has slowed down, affecting TiCDC?