TiCDC Stuck for a Long Time and Not Synchronizing

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiCDC长时间卡住不同步

| username: Jellybean

[TiDB Usage Environment]
Production / Testing / Poc

[TiDB Version]

  1. TiDB cluster version v5.4.0
  2. TiCDC version v5.3.0 (v5.4.0 does not support writing to S3, older version v5.3.0 supports it)
    /cdc version
    Release Version: v5.3.0
    Git Commit Hash: 8b816c9717a6d18430f93a6b212f3685c88d4607
    Git Branch: HEAD
    UTC Build Time: 2022-08-10 04:27:19
    Go Version: go version go1.16.13 linux/amd64
    Failpoint Build: false

[Reproduction Path] What operations were performed to encounter the issue

According to practice, using TiCDC in version v5.4.0 does not support writing to S3, while using the older version v5.3.0 supports syncing to S3.

  1. The cluster’s baseline data is at the TB level, with over 100,000 Region Leaders, and the entire cluster’s write QPS is around 2k~3k, which is not large.

  2. After starting TiCDC synchronization, the puller stage keeps pulling data, and the sorter keeps receiving data but does not output it, resulting in no data being sunk out. The observed QPS for syncing to S3 is 0. The task has been running for nearly 10 hours, and the memory used by the sorter gradually increases and stays at 16GB. When it reaches 16GB, data starts to be written to disk, and the disk space usage keeps increasing, currently close to 100GB and still rising.

  3. From the Grafana panel, the event/s at each stage of the Dataflow are:
    puller output (8300-kv is 3k, 8300-resolved is 250k) -->> sorter output (event/s is 0) -->> mounter output (event/s is 0) -->> sink output (event/s is 0)

  4. The checkpoint of the ticdc sync stream has been stuck at the start time of the sync task and has not progressed. There are continuous ticdc sync task delay alerts.

  5. During the operation, the cluster has no memory, CPU, or IO performance bottlenecks, and there are sufficient remaining resources.

  6. The cluster has two CDC nodes,
    One node continuously prompts: [WARN] [schema_storage.go:733] [“GetSnapshot is taking too long, DDL puller stuck?”] [ts=449931778037383875] [duration=5h50m15.195304586s].
    The other node continuously logs the creation of temporary sorter files on the disk, with filenames like sort-20197-33303.tmp. Since the task started, a total of 35,000 temporary files have been created, and more are still being created.

  7. Current configuration:

  1. The cdc-server parameter per-table-memory-quota is set to 800MB, with 80% of the write traffic in the upstream cluster concentrated on one table.
  2. Current sync task configuration:
{
  "info": {
    "sink-uri": "s3://tidb-backup/ixxxxx?endpoint=xxxx",
    "opts": {},
    "create-time": "2024-05-22T09:39:17.871399198+08:00",
    "start-ts": 449928746197843972,
    "target-ts": 0,
    "admin-job-type": 0,
    "sort-engine": "unified",
    "sort-dir": "",
    "config": {
      "case-sensitive": true,
      "enable-old-value": true,
      "force-replicate": false,
      "check-gc-safe-point": true,
      "filter": {
        "rules": [
          "*.*"
        ],
        "ignore-txn-start-ts": null
      },
      "mounter": {
        "worker-num": 16
      },
      "sink": {
        "dispatchers": null,
        "protocol": "default"
      },
      "cyclic-replication": {
        "enable": false,
        "replica-id": 0,
        "filter-replica-ids": null,
        "id-buckets": 0,
        "sync-ddl": false
      },
      "scheduler": {
        "type": "table-number",
        "polling-time": -1
      },
      "consistent": {
        "level": "none",
        "max-log-size": 64,
        "flush-interval": 1000,
        "storage": ""
      }
    },
    "state": "normal",
    "history": null,
    "error": null,
    "sync-point-enabled": false,
    "sync-point-interval": 600000000000,
    "creator-version": "v5.3.0"
  },

Previous optimization experience with this version of CDC in other clusters: TiCDC双Sink同步增量备份到S3速度慢 - TiDB 的问答社区

In this scenario of the current cluster, it is speculated that there is a bug. Ideally, data should be pulled, filtered, and the filtered results should be directly output to the sorter and subsequent components for processing. Even if the data volume is large, the checkpoint should be slowly updated.

Has anyone encountered this situation before, and how to handle it?

| username: yytest | Original post link

Possible Causes and Solutions for TiCDC Long-Term Stuck and Unsynchronized

TiCDC being stuck and unsynchronized for a long time can be caused by various reasons. Here are some possible causes and corresponding solutions:

  1. Uncommitted Transactions Upstream: If there are long-running uncommitted transactions upstream, TiCDC will wait for these transactions to commit before continuing to synchronize data, which may cause synchronization delays.
  2. Insufficient Internal Processing Capacity: When TiCDC’s internal processing capacity is insufficient, synchronization tasks may report errors such as ErrBufferReachLimit. In this case, you can try adjusting TiCDC’s configuration to improve its processing capacity.
  3. Out of Memory (OOM): If TiCDC’s internal processing capacity is insufficient or the downstream throughput capacity is insufficient, it may cause an out-of-memory (OOM) issue. This can be resolved by increasing resources or optimizing application logic.
  4. Transaction Timeout: TiDB has a transaction timeout mechanism. When a transaction runs beyond max-txn-ttl, it will be forcibly rolled back by TiDB. TiCDC will wait for uncommitted transactions to commit before continuing to synchronize their data, resulting in synchronization delays.
  5. Physical Import Mode and BR Data Recovery: Using TiDB Lightning physical import mode and BR data recovery on tables synchronized by TiCDC may cause TiCDC synchronization to stall or get stuck. In this case, avoid using these operations on tables synchronized by TiCDC, or reconfigure the TiCDC synchronization task after using them.
  6. DDL Operations: If upstream executes a lossless DDL operation, TiCDC will synchronize a DML event with the same old and new data to the downstream. Starting from TiCDC v7.1.0, TiCDC will filter out these useless DML events and no longer synchronize them to the downstream.
  7. Synchronization Task Restart: If the synchronization task stops and restarts, synchronization delays may occur. This is because TiCDC needs to scan the historical versions of data in TiKV. Once the scan is complete, it can continue the replication process, which may take several minutes to tens of minutes.

Solution Steps

  1. Monitoring and Diagnosis: First, monitor the status of TiCDC, especially the comparison between checkpoint time and current time, to detect issues in time.
  2. Restart TiCDC: If it is confirmed that TiCDC is stuck, try restarting the TiCDC service to clear possible temporary faults or state anomalies.
  3. Adjust Configuration: Adjust TiCDC’s configuration according to the specific situation, such as increasing resources and optimizing processing capacity to improve synchronization performance.
  4. Incremental Recovery: If synchronization is interrupted due to large transactions, record the checkpoint-ts of the changefeed terminated by the large transaction, use this TSO as the --lastbackupts for BR incremental backup, and perform incremental backup. After the incremental backup is completed, find the BackupTS in the BR log output, perform incremental recovery, then create a new changefeed starting from BackupTS, and delete the old changefeed.
  5. Avoid Incompatible Operations: Avoid using TiDB Lightning physical import mode and BR data recovery on tables synchronized by TiCDC to prevent unknown errors.
  6. Check Data Consistency: After restoring the synchronization task, check the consistency of data between upstream and downstream to ensure data correctness.
| username: Jellybean | Original post link

Suspect it is related to this bug: TiCDC changefeed’s resolved-ts is stuck #10157

| username: CharlesCheung96 | Original post link

Could you please check the resolvedts panel to see if it is stuck?

| username: Jellybean | Original post link

Yes, the resolved-ts is stuck and hasn’t progressed, the timestamp has been stuck at the moment when the synchronization task was created and started.

| username: zhaokede | Original post link

What were the logs before it got stuck?

| username: CharlesCheung96 | Original post link

In version 5.x of CDC, fast and slow tables can affect each other. It is recommended to split the changefeed and try the following:

  1. Place the table with 80% of the traffic in a separate changefeed.
  2. Place the remaining tables in one or more changefeeds.

Additionally, confirm the total number of tables?

| username: 像风一样的男子 | Original post link

You’re right, I’m also using version 5.4. Previously, putting everything into one changefeed would cause it to get stuck and not sync. Splitting it into multiple changefeeds solved the problem.

| username: Jellybean | Original post link

There are a total of 19 tables.

| username: Jellybean | Original post link

Did it start to get stuck right after launching, or did it get stuck halfway through synchronization?

| username: 像风一样的男子 | Original post link

Initially, there were no issues with synchronization when the data volume was small. However, as the data volume increased, it got stuck. Restarting or deleting and recreating the task would also result in it getting stuck and not synchronizing. Later, splitting it into multiple tasks resolved the issue.

| username: Jellybean | Original post link

Well, if the data volume is large, it directly gets stuck at startup, which seems to be the same phenomenon.

Try splitting the synchronization tasks first.

| username: TIDB-Learner | Original post link

Was early TiCDC really that bad?

| username: porpoiselxj | Original post link

Stop struggling. If you want to use TiCDC well, I suggest upgrading to at least 7.1.3. Check out my previous post to see how frustrating it was, but once I upgraded to 7.1.3, everything became much better.

| username: Jellybean | Original post link

I want to tell you that in version 7.1.3, if you use TiCDC’s filtering feature, there’s a pitfall: if you want to filter DDL (drop/truncate/rename, etc.), it will also filter out the DML operations on the subsequent tables, effectively stopping the synchronization of that table’s data.

This issue is fixed in version 7.1.4 and later. If you are using it, you should consider upgrading.

The good news is that the official team will focus on optimizing and improving the CDC feature, so it will become more refined and user-friendly.