How to Analyze if Slow Push to Kafka Occurs When No Large Transactions are Found on the Source End in TiCDC

translator_bot · June 23, 2024, 12:11am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 源端没有发现大事务，ticdc如何分析是否是推送到kafka慢

| username: wluckdog

[TiDB Usage Environment] Production Environment
[TiDB Version] v6.1.0
[Reproduction Path] Operations performed that led to the issue
[Encountered Issue: Issue Phenomenon and Impact]
Alert Details: TiCDC heap memory usage is over 10 GB
[Resource Configuration]
Configured 256G of host memory for cdc, actual usage is 10G, but it still uses host disk space. How to adjust to use more memory, and the cdc task has delays. No large transactions found on the source side, how to analyze if it is slow to push to Kafka.
total used free shared buff/cache available
Mem: 251 9 232 0 9 241
Swap: 31 0 31
[Attachments: Screenshots/Logs/Monitoring]

translator_bot · June 23, 2024, 12:11am

| username: xfworld | Original post link

You can consider integrating TiCDC monitoring first to gather more metrics, which will be convenient for reference.

You can refer to the official handling method.

translator_bot · June 23, 2024, 12:11am

| username: wluckdog | Original post link

Mainly want to know where the bottleneck of CDC delay is. Based on the 24-hour transaction execution time, there should be no large transactions.

translator_bot · June 23, 2024, 12:11am

| username: xfworld | Original post link

Common troubleshooting steps for latency issues:

Upstream
a. Large transaction commits causing delayed processing
b. Resource bottlenecks leading to slow event processing
c. Network congestion
CDC itself
Refer to the monitoring metrics which are easily accessible, no need to describe further.
Downstream
a. Reception delay, downstream congestion causing backlog and untimely processing
b. Network congestion
c. Insufficient resources leading to slow processing

You will need to investigate each item specifically, which can be quite troublesome. This is for your reference.

translator_bot · June 23, 2024, 12:11am

| username: wluckdog | Original post link

In the case of deploying TiCDC separately, is the usage of TiCDC heap memory on the TiKV host or the CDC host? The principle is still somewhat unclear.

translator_bot · June 23, 2024, 12:11am

| username: Min_Chen | Original post link

Hello, please provide the monitoring data. Export it using Clinic, referring to the method in Using PingCAP Clinic to Diagnose the Cluster.

translator_bot · June 23, 2024, 12:11am

| username: wluckdog | Original post link

There are too many files being collected, the files are too large, and it’s too slow, causing the collection to be interrupted each time. Is there any other diagnostic method?

translator_bot · June 23, 2024, 12:11am

| username: Min_Chen | Original post link

If the method from tclinic is not convenient to operate, you can export the monitoring data using the method described at PingCAP MetricsTool.

translator_bot · June 23, 2024, 12:11am

| username: wluckdog | Original post link

calctidb-cluster-TiCDC_2022-11-18T07_33_28.835Z.rar (1.8 MB)