Update operations are parsed as delete + insert in CDC (upgraded to v7.1.3)

translator_bot · June 21, 2024, 3:34am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 更新操作在cdc中被解析为删除+新增（升级v7.1.3）

| username: porpoiselxj

Bug Report
Clearly and accurately describe the issue you found. Providing any steps to reproduce the problem can help the development team address the issue promptly.

[TiDB Version] v7.1.3
[Impact of the Bug]
The downstream data derived from the CDC incremental logs shows a large number of delete and then insert operations, causing the downstream business update volume to increase rapidly. It is currently unclear whether this is an issue with the CDC incremental log capture or if the database is indeed performing delete and then insert operations. At least from CDC to Kafka and CDC to storage, it has been confirmed that some update operations are indeed broken down into delete + update.

Correction: Manual updates can also reproduce the issue. Updating certain fields will reproduce the issue, while others will not.
The final test result is: updating unique fields will cause the issue.

[Possible Steps to Reproduce the Issue]
Upgrade from version v7.1.1 to v7.1.3

[Observed Unexpected Behavior]
Immediately after the upgrade, many table update operations were pushed to Kafka through CDC as delete + insert in the incremental logs. Not all records exhibit this phenomenon. For some records that do, directly updating them using SQL statements does not reproduce the issue.

[Expected Behavior]
Normal update operations should appear as a single update in the CDC incremental logs.

[Related Components and Specific Versions]
Just upgraded to v7.1.3

[Other Background Information or Screenshots]
None

translator_bot · June 21, 2024, 3:34am

| username: Jellybean | Original post link

First of all, by default, it is available. This is not a defect; it is a matter of usage.

To ensure that the data downstream is idempotent, the write SQL will be rewritten.

You can achieve your goal by modifying and disabling the safe mode parameter.

translator_bot · June 21, 2024, 3:34am

| username: Jellybean | Original post link

During previous tests, it was found that the TiCDC tool in version v5.3.0 would convert upstream Replace operations into downstream Delete + Replace operations, leading to write amplification. In the current version, after observing the upstream DML through the TiCDC tool, the following conversions occur:

1 delete → 1 delete
1 insert → 1 delete + 1 replace
1 update → 1 delete + 1 replace
1 replace → 1 delete + 1 replace

By default, TiCDC’s MySQL sink uses safe_mode, which follows the above conversion method. This affects performance but ensures reentrant synchronization. The MySQL sink configuration allows safe_mode to be turned off (safe_mode is off by default in TiCDC v6.1.3).

To ensure the idempotency of business data synchronization, it is recommended to keep the default safe_mode configuration. The original poster can adjust it as needed.

translator_bot · June 21, 2024, 3:34am

| username: porpoiselxj | Original post link

The current phenomenon is that if the primary key or unique field is updated, the update operation will be split into delete + update. However, I see in the documentation that safe mode is only effective when the downstream is MySQL or TiDB, but my downstream is Kafka.

translator_bot · June 21, 2024, 3:34am

| username: 有猫万事足 | Original post link

Your explanation is a bit convoluted.
Can I understand it as: for some tables, CDC to downstream is an update, while for others, CDC to downstream is delete+update.
For tables that have delete+update, a direct update is also possible. Is that what you mean?

github.com/pingcap/tiflow

split the update event when using avro or csv protocol

opened 02:37AM - 29 May 23 UTC

closed 02:17PM - 03 Jun 23 UTC

3AceShowHand

component/sink type/enhancement area/ticdc affects-6.5 affects-7.1 found/community

Avro and CSV Protocol, does not emit the old value for the update event to downs…tream, so if the primary key is updated, the old data cannot be deleted from the downstream data system. By split the update event into delete event and insert event, the old data can be delete first, and then insert the new data.

I think you can take a look at this.

Avro and CSV Protocol, does not emit the old value for the update event to downstream, so if the primary key is updated, the old data cannot be deleted from the downstream data system.

By splitting the update event into a delete event and an insert event, the old data can be deleted first, and then the new data can be inserted.

It looks like an enhancement made to be compatible with certain protocols.
I have to admit that I am not familiar with TiCDC. This is a possibility.

translator_bot · June 21, 2024, 3:34am

| username: porpoiselxj | Original post link

It looks a bit close, but I’m using the open-protocol protocol.

translator_bot · June 21, 2024, 3:34am

| username: 小龙虾爱大龙虾 | Original post link

When the primary key is updated upstream, it should indeed become delete+update downstream. I think that’s quite reasonable

translator_bot · June 21, 2024, 3:34am

| username: porpoiselxj | Original post link

To some extent, it is reasonable, but it should not be adjusted by default during the upgrade, nor is it mentioned in the upgrade documentation. Alternatively, having a switch to turn it off would be fine, otherwise, it will impact the existing system.

translator_bot · June 21, 2024, 3:34am

| username: 有猫万事足 | Original post link

I think we still need to confirm whether it is related to the primary key update.

If it is indeed a primary key update → delete + update, and a non-primary key update → update, then this is quite normal.
The concern is if a non-primary key update is also delete + update.

As long as it’s not a bug, we can discuss other issues after the new year.
At this point, let’s avoid unnecessary stress.

translator_bot · June 21, 2024, 3:34am

| username: porpoiselxj | Original post link

After repeated testing, it is indeed related to updating the unique field. Updating other fields works normally.

translator_bot · June 21, 2024, 3:34am

| username: TIDB-Learner | Original post link

First delete, then replace. Write it out, I want to know why it’s done this way.

translator_bot · June 21, 2024, 3:34am

| username: zhanggame1 | Original post link

Deleting and then inserting is converted into a replace statement, right?

translator_bot · June 21, 2024, 3:34am

| username: porpoiselxj | Original post link

I have Kafka messages here, which have become two messages: first delete, then add.

translator_bot · June 21, 2024, 3:34am

| username: kkpeter | Original post link

CDC has safe-mode enabled by default, you can turn it off.

translator_bot · June 21, 2024, 3:34am

| username: porpoiselxj | Original post link

Could you please tell me where the safe-mode for cdc to kafka is configured? I have been looking through the documentation for a long time and haven’t found it.

translator_bot · June 21, 2024, 3:34am

| username: LingJin | Original post link

Hello everyone, I am the developer of this feature, and I am very pleased to see the discussion on this issue.

The CDC to Kafka link does not have a safe-mode parameter.
When using the Kafka sink, the split logic for update events is as follows. For a single update event, if there is a modification to the primary key or unique key, it is split into delete + insert statements.

Regarding the splitting of update events, users observe that the output content has changed. This behavior change appears to be for compatibility with certain protocols, such as Avro and CSV, but it also involves the correctness of other modules’ functionalities, such as the index-value dispatcher.

Consider the following SQL statements:

create table t (a int primary key, b int);
insert into t values (1, 2);
update t set a = 2 where a = 1;

Using the index-value dispatcher, the changes of the above two SQL statements are distributed to a topic with multiple partitions.
The first insert message, assuming it is sent to partition-1, is calculated based on primary key = 1.
For the second update statement, if not split, it would be distributed to partition-2.

A possible scenario is that the consumer first consumes the data from partition-2, at which point there may not be a row with primary key = 1 in the downstream system, which could lead to an error.

After splitting, the update statement becomes delete 1, insert 2.
The data flow of partition-1 has insert-1, delete-1 in sequence.
The data of partition-2 has insert-2.

At this point, regardless of the consumption progress of the consumers in the consumer group, the events corresponding to each key can be consumed in the correct order. This is the core motivation for making the above changes.

Regarding the splitting of key updated events, it refers to Debezium connector for MySQL :: Debezium Documentation

As you can see, they explicitly distinguish update events.

translator_bot · June 21, 2024, 3:34am

| username: porpoiselxj | Original post link

Hello!

We have no objections to this feature and can understand the reasons for doing so. However, not adding any parameters to the original version and not including any relevant explanations in the release notes is not a good practice.

Every feature has its use case. For CDC to Kafka, our use case is to ensure that a single table is guaranteed to be in a single partition to strictly maintain the data order of the single table. For our application scenario, there is no need to split an update into delete+insert in special cases.

If this was the design from the beginning, we could have adapted. But suddenly changing an existing system without any explanation makes our downstream applications very passive.

Currently, all the technical leaders in our company have been convened to assess the impact of this incident and to determine whether to roll back the TiDB version. Of course, we also have our own issues, such as failing to detect this change in the development and testing stages.

translator_bot · June 21, 2024, 3:34am

| username: Jellybean | Original post link

Thank you to the development experts for coming out to answer questions and clear up confusion.

translator_bot · June 21, 2024, 3:34am

| username: LingJin | Original post link

This PR involves a relatively large number of changes at once, and the release note only covers part of the content.
We will update the documentation later to explain the situation. We will also pay attention to such changes in behavior in the future.

For a single table single partition, one issue is that CDC throughput is limited, and the Kafka consumer consumption rate cannot be horizontally scaled. If the traffic for a single table is very large, using the index-value dispatcher is a better choice and can significantly improve throughput.

For your scenario, not splitting will not cause problems. However, for more general scenarios, it is necessary.

We apologize for the inconvenience caused.

translator_bot · June 21, 2024, 3:34am

| username: kkpeter | Original post link

cli changefeed create --pd=http://%s:%d --sink-uri=“mysql://%s:%s@%s:%d/?safe-mode=%s”