Data/Event Loss Caused by Changefeed Restart When TiCDC Sink

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiCDC Sink时Changefeed重启,导致Data/Event 丢失

| username: TiDBer_PYmtzTZg

【TiDB Usage Environment】Production Environment
【TiDB Version】5.1.4
【TiCDC Version】5.1.4
【Reproduction Path】Occasionally occurs in the production environment, unable to reproduce
【Encountered Problem: Phenomenon and Impact】

  • Phenomenon and Attempts
    TiCDC cluster uses canal-json format to write into Kafka, users reported data loss.

The data loss is concentrated in a few specific clusters, each time accompanied by a corresponding changefeed “restart”, because the metric is reset, accompanied by latency (ranging from a few minutes to several hours, eventually likely to recover automatically) and a high probability of region scanning (in the example below, region scanning was not observed).


During the time period shown in the figure, 74 pieces of data were successfully delivered, but the puller and sink actually received 76 pieces, and the downstream still did not receive these 2 pieces of data, and they were not in the topic.

From the log situation of latency alone, there are a large number of “synchronize is taking too long, report a bug”, without any special error logs.
The remaining logs are quite similar to this issue: TiCDC同步卡住,提示:GetSnapshot is taking too long, DDL puller stuck? - TiDB 的问答社区
However, upgrading the slave TiKV (connected by TiCDC) to 5.4.1 alone did not improve the latency situation.

Through the metrics of TiCDC’s internal puller and sink and the counting statistics added by myself, the data was sent to the puller and sink, but it was not successfully delivered to Kafka (nor was any sarama error observed). As a result, after the changefeed crashed and restarted, it did not resend to ensure at least once. The number of lost entries is not large in proportion, concentrated in a certain period of latency.

  • TiCDC is deployed on K8S, and for the problematic clusters, resources are adequately provided.
  • It seems that proactive cluster restarts do not cause data loss.
  • No resource tension or error logs were observed on the TiKV side, and most of the time there were no significant region changes.

【Resource Configuration】Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
【Attachments: Screenshots/Logs/Monitoring】

| username: WalterWj | Original post link

Restarting the task should not cause data loss. TiCDC’s task has checkpoint records, and logically, the restart should continue according to the checkpoint. It might transmit more data, ultimately achieving idempotency, but it would be a bit strange :thinking:.

Is it just a delay and not data loss?

| username: TiDBer_PYmtzTZg | Original post link

Yes, this is very confusing. It’s not about latency; sometimes users report missing data from changes made a month ago (for example, issues discovered during end-of-month reporting and settlement).

| username: WalterWj | Original post link

If that’s the case, it will be more difficult to troubleshoot. Could it be a bug? Are you sure no one downstream is connected and executing deletions?

| username: xfworld | Original post link

You can check if the bug is caused by DDL.
image

| username: TiDBer_PYmtzTZg | Original post link

The downstream is Hive on Hudi, which cannot be changed.
After resetting the task, i.e., using TiSpark to redo the historical data synchronization (bootstrap), the data is fixed.

| username: TiDBer_PYmtzTZg | Original post link

Thank you, this has been noted. The reason it wasn’t merged before was because it seemed to address a different issue. Previously, it was thought that the observed problem was the loss of events, while the issue being resolved was that updates were interpreted as deletes, causing downstream data updates to make the data “appear” lost.

Now it seems that maybe the events were not lost. The phenomenon still looks quite similar.

| username: xfworld | Original post link

:rofl: This kind of issue isn’t very fun, but Kafka events can be replayed as long as the data hasn’t been cleaned up. You can verify it.

| username: TiDBer_PYmtzTZg | Original post link

The root cause was an internal version change that broke the system. In the remaining code, to solve a performance issue with a sink, the mechanism for checkpoint loop waiting for data confirmation at cdc/sink/mq.go:194 was removed. This caused the checkpoint to sync to PD to be larger than resolvedTs. After the task was interrupted and restarted, the actual data was only sent to ts1, but checkpointTs = startTs > ts1, resulting in data loss between ts1 and startTs.

There are generally a few approaches to this type of problem:

  1. Check if the downstream sending has confirmation, such as Kafka’s ack mechanism.
  2. Check if the number of events in the internal puller and sink are consistent.
  3. When there is an abnormal cdc task restart, investigate whether the checkpoint value before the task restart (checkpointTs) and after (as start-ts) truly reflect the data delivery situation.
| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.