What pitfalls have you encountered when using TiCDC, and what [best practice] recommendations do you have for online use?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 你在使用TiCDC时踩过什么坑,线上使用有何[最佳实践]建议?

| username: Jellybean

As the title suggests, the online business cluster has a load of 10000+ write (insert+delete+update+replace) QPS. What pitfalls should be noted in advance when using TiCDC to synchronize to S3 or to a downstream TiDB cluster?

| username: 人如其名 | Original post link

If synchronizing to a downstream TiDB cluster (using the downstream TiDB cluster as a remote high availability), from the perspective of data synchronization security (not synchronization efficiency), I feel there are a few points to note:

  1. Regarding regular data comparison, it is recommended that all tables must be clustered index tables to avoid cluster jitter when the sync_diff tool performs data comparison. Refer to: sync_diff_inspector做上下游数据对比时可能会导致集群性能抖动 - TiDB 的问答社区
  2. Do not use the Lightning local mode to import data to avoid data desynchronization (for special batch tables, synchronization can be unset, and both clusters can use local mode to import data separately, reducing TiCDC bandwidth).
  3. TiCDC will not synchronize users, user permissions, or passwords; these need to be maintained manually.
  4. TiCDC will not synchronize sequence objects or views, so regular checks are needed. The values of sequences will also not be synchronized, so a method is needed to periodically synchronize and predict the current maximum value of the sequence to avoid business failures caused by the sequence value being smaller than the value in the table after switching over.
  5. If SQL statement bindings are made in the primary database, they will not be synchronized to the backup cluster and need to be synchronized regularly.

For the above points 3, 4, and 5, you can use something_for_tidb: nothing..................... - Gitee.com for simple comparisons during regular checks.

| username: 像风一样的男子 | Original post link

I once encountered a problem in production where a cluster CDC had around 3000+ TPS (mainly inserts) synchronizing to downstream Kafka. The disk I/O of the downstream Kafka was maxed out, causing the CDC process to get stuck. It neither synchronized nor reported an error. Fortunately, I had written a monitoring script to regularly check the synchronization time lag. After fixing the downstream Kafka, the CDC task was still stuck. It returned to normal after restarting the CDC service.

| username: 我是人间不清醒 | Original post link

Previously, there were many issues when synchronizing data to MySQL using TiCDC, so we didn’t dare to use it in production. Later, we encountered a situation where synchronizing data to TiDB using TiCDC failed because the newly created table on the source side did not have a primary key set. This was resolved by implementing an auditing tool to restrict some unreasonable operations. It has been running for half a year without any further failures and has been very stable.

| username: dockerfile | Original post link

Similar to my thought process.

Solution:

  1. Monitor and compare the checkpoint time with the current time.
  2. Restart with the command restart -R cdc.
| username: 小龙虾爱大龙虾 | Original post link

  1. By default, CDC can only handle a single table on one node. With such a large volume, one node might not be able to handle it. You can enable the parameter enable-table-across-nodes to split a single table by Region.
  2. Pay attention to hotspot issues, both upstream and downstream.
  3. Use a higher version of CDC; the higher, the better. :joy:
| username: Billmay表妹 | Original post link

At that time, we will invite the head of TiCDC development to respond to everyone’s needs and questions one by one.

| username: mono | Original post link

I haven’t used it yet. DM is quite handy to use.

| username: 随便改个用户名 | Original post link

Once, a large table in TiDB had its field type modified, causing the entire table’s data to be refreshed. As a result, the downstream Kafka was completely filled. :face_with_peeking_eye:

| username: Jellybean | Original post link

This is indeed a bit unexpected. The synchronization principle of TiCDC is to obtain row changes from TiKV for synchronization. When the upstream modifies the table field type, it triggers data readjustment, which likely causes changes to the KV data of the entire table, resulting in a complete resynchronization. This issue should be fixed in the new version.

| username: Raymond | Original post link

TiCDC may encounter out-of-order DDL in certain scenarios, causing errors in version 6.5.

| username: kkpeter | Original post link

  1. If the pause time is longer and the amount of updated data is large, CDC will consume a lot of memory to pull the data and will not be able to synchronize the data in time. I think pump/drainer is more reliable in this regard.

  2. The safe-mode mechanism of CDC is also not as reasonable as pump/drainer.

| username: kkpeter | Original post link

DM and CDC are not the same thing.

| username: Jellybean | Original post link

In practical use, we have also encountered this issue. If the TiCDC replication stream has a delay and a restart occurs (either manually adjusting parameters or automatic task restart), there will be a high instantaneous CPU spike. For example, in a previous case, after the replication task had a delay of over 20 hours, manually restarting the replication stream caused an instantaneous CPU consumption of 6000% (out of a total of 8000%). This could potentially bring some additional problems, especially if mixed with other nodes, which might impact the cluster.

This also indicates that in extreme scenarios, especially when the delay is close to 24 hours (GC safe point TTL), restarting the replication stream will cause a significant instantaneous CPU spike. This is because TiCDC needs to re-fetch all the lagging data and reprocess the changes that occurred during the lag period after the restart.

The higher the delay, the more resources the node will consume when the replication stream restarts. Therefore, until the official solution to this problem is thoroughly addressed, it is crucial to handle TiCDC replication stream delays promptly.

| username: Jellybean | Original post link

Lobster Master, may I ask if the cluster is v5.4.0, can we use a higher version of the ticdc component, for example, replacing it with v7.5.x cdc through a patch?

| username: 小龙虾爱大龙虾 | Original post link

It is not recommended to skip major versions, but minor versions are fine. For a TiDB v5.4.x cluster, you can individually patch cdc to the latest version of v5.4.x.

| username: hzc989 | Original post link

DM encountered a lot of issues, whereas CDC didn’t have many major problems. However, there’s a particular need I’d like to mention :rofl:

Consider adding native support for traffic compression. In our scenario, we mainly use TiCDC for cross-region disaster recovery synchronization, and the traffic cost is high. Currently, we are using a proxySQL in front to handle traffic compression.

| username: TIDB-Learner | Original post link

In the near future, we will utilize TiCDC in our project. Check out the TiCDC source code analysis on Bilibili.