CDC TSO Not Advancing (Some Tables Stuck)

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: cdc tso不推进(有部分表卡住)

| username: porpoiselxj

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version] V7.1.1
[Reproduction Path]
Created a relatively large composite index (table size is about 400 million), causing CDC to stall and report an error [CDC:ErrPDEtcdAPIError] etcd API call error: context deadline exceeded. Paused changefeed, and after the index was built, attempted to resume.
After resuming CDC, data started generating in Kafka, but the TSO of the changefeed did not advance. By examining the messages in Kafka, it was found that some tables had already pushed to the latest messages, but some incremental data was stuck (likely due to the TSO not advancing). However, no specific error information was found in the error logs (there were some error messages that are usually present, such as: fail to load safepoint from PD / requested PD is not the leader of the cluster, etc.)
[Encountered Issues: Problem Phenomenon and Impact]
There are two main issues:

  1. Creating an index causes CDC to stall.
  2. After resuming the changefeed, the TSO cannot advance due to some tables, and the reason is not visible.
    Once a certain amount of data is reached, it can only be interrupted for a very short time. If the time is slightly longer, it basically cannot be recovered and the CDC has to be rebuilt, resulting in data loss.

[Resource Configuration]
PD / TiDB-server / CDC mixed deployment *3, server resources: 256G memory / 48 cores / SSD. Hardware resources are not an issue, TiKV is independently deployed.
During the issue, there were significant fluctuations in the relevant CDC monitoring.

| username: Billmay表妹 | Original post link

Mixed deployment is more likely to cause resource contention. It is recommended to deploy PD and TiDB together, but TiCDC should still be deployed separately.

| username: Billmay表妹 | Original post link

Troubleshooting steps:

  1. Check TiCDC logs: Review TiCDC logs, especially around the time the issue occurred. Look for any errors or anomalies related to index creation. You can use the command tail -n 1000 <cdc_log_file> to view recent logs.
  2. Check CDC monitoring metrics: Use TiCDC monitoring metrics to check its operational status. Pay special attention to metrics like CDC latency, sync speed, and error counts. Tools like Prometheus or Grafana can be used to view these metrics.
  3. Check TiCDC configuration: Review the TiCDC configuration file to ensure the parameters are set correctly. Pay special attention to performance-related parameters such as ticdc.changefeed.max-resolved-ts-lease and ticdc.changefeed.sync-dml-interval. Refer to TiCDC’s official documentation [1] for the meaning and recommended values of these parameters.
  4. Check TiCDC version: Ensure you are using the latest version of TiCDC. New versions typically fix known issues and include performance improvements.
  5. Check TiKV status: Use TiUP or PD-CTL tools to check the status of the TiKV cluster, ensuring all nodes are running normally and there are no abnormal Region or Leader distributions. You can use the commands tiup ctl:v5.1.1 pd -u <pd_address> store and tiup ctl:v5.1.1 pd -u <pd_address> region to view the status of the TiKV cluster.
  6. Check TiKV monitoring metrics: Use TiKV monitoring metrics to check its operational status. Pay special attention to CPU usage, memory usage, and disk IO metrics. Tools like Prometheus or Grafana can be used to view these metrics.
| username: 芮芮是产品 | Original post link

The hard days are all mixed together.

| username: dba远航 | Original post link

Caused by resource contention

| username: xfworld | Original post link

Check the metrics related to CDC TSO, the current chart in the post doesn’t allow for a judgment.

Indexes won’t cause CDC to get stuck. Data backfill will additionally occupy IO, and unless physical IO is insufficient, it will result in a similar stuck situation.
Especially for large tables with a large amount of data, after the table structure changes, data needs to be refilled…

| username: TiDBer_小阿飞 | Original post link

Check the hardware resource usage by using the top command to observe what is consuming resources.

| username: porpoiselxj | Original post link

Could you please provide the recommended configuration for independent CDC deployment?

| username: Billmay表妹 | Original post link

Each instance should use separate resource configurations. Do not deploy them together.

| username: porpoiselxj | Original post link

After upgrading the TiDB version to v7.1.3, the CDC component has been adjusted from the previous mixed deployment with TiDB & PD to an independent deployment, and there are basically no issues.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.