Is there a possibility of data loss in TiCDC?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: ticdc是否有丢数据的可能性

| username: daijunjian

[TiDB Usage Environment] Production Environment

[TiDB Version] 5.1.4

[Encountered Problem: Phenomenon and Impact]

  1. Received the following alert, which lasted for a few minutes
    ticdc_memory_abnormal
    Status: firing, Level: warning
    TiCDC heap memory usage is over 10 GB
    cluster: tidb-prod, instance: 127.0.0.1:8300, values: 1.6023378784e+10, first alert: 2023-09-02T14:03:58

  2. The target end is Kafka

  3. I would like to inquire if there is any possibility of data loss in the TiCDC to Kafka link? What are the common possibilities?

| username: 像风一样的男子 | Original post link

There are no absolute things in this world.

| username: Fly-bird | Original post link

This possibility is relatively high.

| username: daijunjian | Original post link

Let me add my points of suspicion:

  1. When the data volume is large, will TiCDC be unable to handle it and lose data?
  2. When TiCDC writes to Kafka, is the successful write determined by receiving a Kafka ack?
| username: TiDBer_小阿飞 | Original post link

Pfft… Telling the blunt truth :smile:

| username: RenlySir | Original post link

  1. When ticdc cannot handle the load, it will only result in slower writes or errors, but it is impossible to lose data.
  2. ticdc v7 supports single-row validation.
| username: daijunjian | Original post link

Thank you, thank you!

  1. If TiCDC crashes, it won’t lose data, right?
  2. The version I am currently using is V5.
  3. When TiCDC writes to Kafka, is the success determined by receiving a Kafka ack? Or can the “write success determination” strategy be configured?
| username: RenlySir | Original post link

  1. TiCDC has the ability to resume from a breakpoint when it crashes and recovers.
  2. You can check the screenshot below for information about ACK.
  3. If TiCDC has high requirements, it is recommended to upgrade to TiDB v7.1.x. Both TiDB and TiCDC will offer a different user experience.
| username: 像风一样的男子 | Original post link

Are you using a single-node CDC? It’s recommended to have three nodes for high availability.

| username: ti-tiger | Original post link

  • There are several common reasons why data loss might occur in the ticdc-to-kafka link:
    • Insufficient memory or OOM on the ticdc side, causing the ticdc process to exit abnormally or restart. In this case, ticdc will attempt to continue syncing data from the last checkpoint. However, if the checkpoint interval is too long, or if the checkpoint information is lost or corrupted, data loss may occur.
    • Improper or abnormal configuration on the kafka side, causing ticdc to fail to write data normally or fail to write data. In this case, ticdc will decide whether to resend the data or exit with an error based on kafka’s required-acks parameter and retry strategy. If the required-acks parameter is set to 0 (not waiting for any response), or if the retry attempts are exhausted and still fail, data loss may occur.
    • Network failures or delays, causing communication interruptions or timeouts between ticdc and kafka, similar to the above scenarios.
| username: daijunjian | Original post link

Is the required-acks parameter only available in the new version of TiCDC? It seems that version V5.2 does not have it. Could you help confirm what the ack strategy is for version 5.2?

| username: 路在何chu | Original post link

Losing data? That’s unlikely, unless it’s due to your own operational issues or a single-node CDC failure. As long as TSO is fine, data won’t be lost.

| username: RenlySir | Original post link

The default value of ack is 1.