Does TiCDC Update Behavior Affect Canal-JSON?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiCDC Update 行为变更是否影响 Canal-JSON

| username: ealam_小羽

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] 7.5.1
[Reproduction Path] None
[Encountered Problem: Problem Phenomenon and Impact]
[Resource Configuration] Enter TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachment: Screenshot/Log/Monitoring]
Question: According to the official documentation, it is found that some subsequent Update behaviors of TiCDC will be split into Delete + Insert. Will this affect the Canal-JSON protocol?
Background: Planning to upgrade TiDB from 5.x to 7.5.1
In the business, Canal-JSON is used to obtain the change log of the data before and after the change (recording the change log of some fields of a certain data). If v7.5.1 splits an update Canal-JSON into two Delete + Insert, the business will mistakenly think that some fields have been added instead of changed.

| username: TIDB-Learner | Original post link

If you split updates into deletes and inserts, and you use CSV and Avro protocols which might only output new values, it could have an impact. My understanding is that it’s because you can’t compare before and after the update.

| username: zhh_912 | Original post link

Changes in TiCDC’s Update behavior may affect the working principle of Canal-JSON. TiCDC (TiDB Data Change Detection) is TiDB’s data change capture tool, while Canal-JSON is a data change capture tool for MySQL that relies on MySQL’s binlog logs to track data changes.

If the change in TiCDC’s Update behavior means it no longer follows MySQL’s binlog log format, Canal-JSON may not be able to correctly parse and recognize these changes. This change could cause Canal-JSON to malfunction or fail to correctly capture TiDB’s data changes.

Solutions:

  1. Check TiCDC’s update logs to understand the specific changes.
  2. If TiCDC has changed the log format, check if Canal-JSON supports parsing the new log format.
  3. If Canal-JSON does not support the new log format, consider updating Canal-JSON or finding an alternative tool that supports TiDB binlog format data change capture.
  4. If possible, try configuring TiCDC to use a log format compatible with MySQL to ensure compatibility with Canal-JSON.
  5. If the above solutions are not feasible, consider developing a tool yourself to parse TiCDC’s log format and convert it to a format supported by Canal-JSON.
| username: ealam_小羽 | Original post link

We only used the canal-json protocol. Theoretically, we could consider making more granular judgments based on the protocol rather than the sink type (MySQL, MQ). Now, looking at the code, it seems that the judgment is only based on the sink type.

| username: ealam_小羽 | Original post link

Originally, I thought it was based on the concept of priority: if there is a primary key, updates to the unique index would not be split. But looking at the code, it seems that both primary key and unique index updates will be split.

Kafka’s splitting can be considered as addressing the second issue, where key changes are split into two partitions. However, the key is by default the primary key. If a primary key exists, the changes should still be on one partition and not distributed across two partitions.

Therefore, theoretically, for an MQ data source with a primary key, updates to the unique index can be considered not to be split.

In normal business scenarios, primary keys are usually self-generated distributed IDs or TiDB’s random IDs. Unique index changes might occur, but primary key changes would not.