Not putting it in a single transaction can ensure better consistency. TiCDC actually synchronizes data by row based on the commit timestamp.
For example, if you execute the following three statements:
update table1 set status = 'start';
commit; ------- This is at 14:51:33 on July 26, 2023
start spark to insert a large amount of data, similar to insert xx into table2;
commit; ------- This is at 14:51:47 on July 26, 2023
update table1 set status = 'end';
commit; ------- This is at 14:52:07 on July 26, 2023
Then in CDC, three batches of data will be generated:
First, 500 rows of data with table1 status = 'start'
at 14:51:33 on July 26, 2023 (assuming table1 has 500 rows) → 200 rows of data in table2 at 14:51:47 on July 26, 2023 (assuming 200 rows were inserted) → 500 rows of data with table1 status = 'end'
at 14:52:07 on July 26, 2023.
On the other hand, if you put it in a single transaction:
update table1 set status = 'start';
start spark to insert a large amount of data, similar to insert xx into table2;
update table1 set status = 'end';
commit; ------- This is at 14:57:56 on July 26, 2023
This will generate a single batch of data at once, which is 200 rows of data in table2 at 14:57:56 on July 26, 2023 (assuming 200 rows were inserted) and 500 rows of data with table1 status = 'end'
.
In this case, the situation you mentioned might occur. For example, the downstream table1 status = 'end'
500 rows of data have already been executed, but the 200 rows of data in table2 have not yet been executed, causing CDC to become abnormal.