TiCDC latency is very high, suddenly returns to normal at a certain point in time

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiCDC 延迟很高,在某一时刻突然恢复正常

| username: TiDBer_gLV5ml22

[TiDB Usage Environment] Production Environment / Testing
[TiDB Version] 5.4.0, CDC version is 5.4.3
[Reproduction Path] Reproduced under stress testing
[Encountered Problem: Phenomenon and Impact]
TiCDC latency is very high, and at a certain moment, the latency suddenly decreases and then returns to normal. The write rate has not met expectations, and the logs indicate that flush is too slow, but checking the downstream mq load shows it is not high.
[Resource Configuration] Enter TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]


| username: xfworld | Original post link

Are the cluster resource load and the load on the ticdc nodes both normal?

| username: yiduoyunQ | Original post link

Search for BIG_TXN in TiDB logs.

| username: zhaokede | Original post link

Is it periodic?
Is there any change in business during this time?

| username: TiDBer_gLV5ml22 | Original post link

The cluster load is not high, the stress test stopped yesterday, this is the CDC’s.

| username: TiDBer_gLV5ml22 | Original post link

It was discovered during the stress test. There were no actions taken on the business side. The stress test was started yesterday, and after noticing high latency, it was stopped. This morning, the latency suddenly decreased without any operations being performed in the meantime.

| username: TiDBer_gLV5ml22 | Original post link

I communicated with the business about this issue and advised them not to insert too much data at once during stress testing. There were no BIG_TXN occurrences in yesterday’s stress test, but the problem still exists. There are many errors like this in the TiDB logs. Is it related to this?

[ERROR] [terror.go:307] ["encountered error"] [error="read tcp 10.142.28.29:5000->10.149.44.1:40913: read: connection reset by peer"] [stack="github.com/pingcap/tidb/parser/terror.Log\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/parser/terror/terror.go:307\ngithub.com/pingcap/tidb/server.(*Server).onConn\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/server.go:516"]
| username: 这里介绍不了我 | Original post link

Why not consider upgrading? The CDC performance of version V6 is very robust.

| username: WinterLiu | Original post link

It feels like there are large transactions on the source end.

| username: TiDBer_gLV5ml22 | Original post link

Is version 6 compatible with TiDB 5.4.0? Which version do you recommend, 6.0 or 6.6?

| username: TiDBer_gLV5ml22 | Original post link

Can insert also cause large transactions?

| username: WinterLiu | Original post link

If you have a large number of inserts during stress testing, there will definitely be IO pressure, which will lead to CDC replication delays.
Additionally, if the insert statements in the stress test involve multi-row commits, the transactions will definitely be numerous and large, and delays will naturally occur.

| username: TiDBer_gLV5ml22 | Original post link

The main issue is that even after stopping the stress test, the synchronization is still very slow. The various metrics in dataflow are quite low, and the downstream load is not high either. I checked the logs and there are no BIG_TXN.

| username: 这里介绍不了我 | Original post link

For cluster versions 6.5 and above, it is recommended to use CDC; for versions below 6.5, it is recommended to use binlog.

| username: dba-kit | Original post link

Stress testing? Are you still in the POC stage? I suggest using the latest version 7.1 for testing, don’t use the 5.X version. The performance of TiCDC has improved a lot in the 6.X version.

| username: TIDB-Learner | Original post link

TiCDC’s lower versions are not very good. I used it before, and issues would occasionally arise. The synchronization timeliness was poor. Now it has been upgraded to 6.5.x, and we need to use TiCDC in the next phase. I hope, as everyone says, the performance will be high, high, high.

| username: TiDBer_gLV5ml22 | Original post link

Originally, we were running version 5.4.0 online and encountered a deadlock issue. We planned to upgrade to 5.4.3 to resolve it, but the problem still persists. Tomorrow, we plan to test version 6.5.9. We are somewhat concerned about version 7.1 because upgrading across two major versions might lead to compatibility issues.

| username: 健康的腰间盘 | Original post link

Use third-party synchronization tools.

| username: yytest | Original post link

  • Optimize TiKV Configuration:
    • Adjust TiKV configuration parameters based on monitoring data, such as increasing raft-base-tick-interval, raft-log-gc-tick-count, etc.
  • Optimize TiCDC Configuration:
    • Adjust the concurrency of TiCDC, and appropriately increase parameters like max-message-bytes and batch-size.
    • If using MQ, consider increasing the number of MQ consumers to improve consumption speed.
  • Adjust Write Rate:
    • If the write rate does not meet expectations, try reducing write pressure or optimizing the write mode, such as batch writing instead of single writing.
  • Hardware Upgrade:
    • If resource configuration is insufficient, consider upgrading hardware, especially disk I/O and network bandwidth.
| username: juecong | Original post link

In addition to transactions, also observe whether the thread configuration size and memory usage in CDC are reasonable.