Is there an order for synchronization between different tables in TiCDC? Is it necessary to use transactions to ensure it?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiCDC 不同表之间同步是有顺序的吗?一定要用事务才能保证吗?

| username: breakyang

I would like to ask everyone a question:
update table1 status = ‘start’
start spark to insert a large amount of data, similar to insert xx into table2
update table1 status = ‘end’

During the process of using TiCDC to synchronize data from the main TiDB cluster to the secondary TiDB cluster, if an exception occurs, is it possible that the status = ‘end’ of table1 is synchronized to the secondary cluster, but the data of table2 is not completely transmitted?

| username: redgame | Original post link

It may indeed lead to partial data synchronization being incomplete. The specific situation depends on the timing of the anomaly.

| username: breakyang | Original post link

If they are placed in a single transaction, can this issue be guaranteed?

| username: tidb菜鸟一只 | Original post link

Data Synchronization Consistency

  • MySQL sink
    • TiCDC does not split single-table transactions, ensuring the atomicity of single-table transactions.
    • TiCDC does not ensure that the execution order of downstream transactions is completely consistent with the upstream.
    • TiCDC splits cross-table transactions on a per-table basis, not ensuring the atomicity of cross-table transactions.
    • TiCDC ensures that the update order of a single row is consistent with the upstream update order.
| username: breakyang | Original post link

Does that mean I can’t guarantee it even if I put it in a transaction?

| username: tidb菜鸟一只 | Original post link

Not putting it in a single transaction can ensure better consistency. TiCDC actually synchronizes data by row based on the commit timestamp.

For example, if you execute the following three statements:

update table1 set status = 'start';
commit;  ------- This is at 14:51:33 on July 26, 2023
start spark to insert a large amount of data, similar to insert xx into table2;
commit;  ------- This is at 14:51:47 on July 26, 2023
update table1 set status = 'end';
commit;  ------- This is at 14:52:07 on July 26, 2023

Then in CDC, three batches of data will be generated:
First, 500 rows of data with table1 status = 'start' at 14:51:33 on July 26, 2023 (assuming table1 has 500 rows) → 200 rows of data in table2 at 14:51:47 on July 26, 2023 (assuming 200 rows were inserted) → 500 rows of data with table1 status = 'end' at 14:52:07 on July 26, 2023.

On the other hand, if you put it in a single transaction:

update table1 set status = 'start';
start spark to insert a large amount of data, similar to insert xx into table2;
update table1 set status = 'end';
commit;  ------- This is at 14:57:56 on July 26, 2023

This will generate a single batch of data at once, which is 200 rows of data in table2 at 14:57:56 on July 26, 2023 (assuming 200 rows were inserted) and 500 rows of data with table1 status = 'end'.

In this case, the situation you mentioned might occur. For example, the downstream table1 status = 'end' 500 rows of data have already been executed, but the 200 rows of data in table2 have not yet been executed, causing CDC to become abnormal.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.