Frequent Primary Key Conflict Errors in DM Synchronization, but No Actual Conflict

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: DM同步频繁报错主键冲突问题,但实际没有冲突

| username: 舞动梦灵

When synchronizing MySQL to TiDB with an online DDL, since TiDB has already added the index in advance, it skips the error, or (as previously tested) it will automatically skip. However, for this task, after manually skipping, it frequently reports primary key conflicts, but in reality, there are no conflicts. Manually resuming the task makes it synchronize normally. But after ten minutes to half an hour, it will again indicate a conflict.

| username: Jellybean | Original post link

When manually resuming the task, check if the binlog position has rolled back. It is possible that after rolling back, the same error is triggered again when resynchronizing to the problematic point.

Additionally, you can start a new synchronization task and synchronize to a downstream table without a pre-created primary key to see if this is the cause of the issue.

| username: 有猫万事足 | Original post link

After DM resumes the data synchronization task from the checkpoint, it may re-execute certain binlog events, leading to the following issues:

  1. During incremental synchronization, the execution of DML operations and the writing of checkpoints are not synchronized; the writing of checkpoints and the writing of downstream data also cannot guarantee atomicity. Therefore, when DM exits abnormally, the checkpoint may only record a recovery point before the exit time.
  2. When DM restarts the synchronization task and resumes incremental data synchronization from the checkpoint, some data after the checkpoint may have already been processed before the abnormal exit, leading to some SQL statements being executed repeatedly.
  3. If the INSERT operation is executed repeatedly, it will cause primary key or unique index conflicts, interrupting the synchronization; if the UPDATE operation is executed repeatedly, it will cause the previous corresponding update record to not be found based on the filter conditions.

This issue is discussed in the safe mode section.
Safe mode can solve this problem.

Moreover, safe mode has an automatic activation mechanism.

When DM resumes the incremental synchronization task from the checkpoint (e.g., worker restart, network disconnection and reconnection, etc.), it will automatically enable safe mode for a period of time.

I speculate that every time you resume and it works again, it might be because this automatically enabled safe mode temporarily rewrote the conflicting SQL. After running for a while, safe mode exits, and then when it encounters the next conflicting SQL, it reports an error and stops again.

| username: dba-kit | Original post link

It is highly likely to be this issue. It is recommended to use the sync-diff tool to compare the table with the error and check whether the upstream and downstream data are consistent.

| username: 舞动梦灵 | Original post link

He has been syncing continuously. It syncs to a certain point, then prompts a conflict, but the conflicting data is not found in TiDB. Then, after manually resuming, it continues to sync, and based on the sync progress, it seems to have rolled back. After syncing for about ten minutes, the same issue occurs again, requiring another manual resume. Unless it is real-time syncing, the problem persists. If there is a delay due to a large transaction, the error will frequently occur when catching up with the logs.

| username: 舞动梦灵 | Original post link

This is an incremental synchronization filtering delete. The target TiDB requires full data, but the source only has data from the past few months. This is already a newly initiated task.

| username: 有猫万事足 | Original post link

Could it be that the conflicting data is within a transaction, and if the transaction fails, everything gets rolled back?

Then this might be the reason for the conflict within a transaction. It could be that a large transaction is initiated, inserting->deleting->re-inserting a piece of data. Because you filtered out the delete event, the same data in the downstream will be inserted twice within the same transaction.

In this case, the best solution might be to enable safe mode in DM. Since the table data in the upstream and downstream are not synchronized, sync-diff cannot compare consistency either.

| username: 双开门变频冰箱 | Original post link

Compare the data.

| username: 舞动梦灵 | Original post link

It is impossible for the upstream to manually delete data. These are all order data, and the conflicts do not always occur, only when the delay is more than 1 hour. When it starts catching up, it prompts a conflict after about 10 minutes, and then clicking resume continues the synchronization without this error.

During the Spring Festival, I added indexes and performed other large transactions on the order database. So there were two or three times when it was delayed for several hours. As long as the delay is a few hours, it will prompt a primary key conflict. However, as long as the delay is within 3 minutes, meaning real-time synchronization is normal, this issue does not occur.

| username: 舞动梦灵 | Original post link

There are almost no large transaction operations now. So currently, there won’t be any issues with real-time synchronization. I just want to understand why this problem occurs. The conflict prompt shows an insert SQL, and it is not a case where a large number of inserts occur in the same transaction. Additionally, it is not a scenario where an insert is followed by a delete and then another insert within the same transaction. This is an order database where business deletions are almost impossible. It is data for parking lot entries and exits, with only inserts and updates.

| username: 哈喽沃德 | Original post link

Is it multiple sources?

| username: 舞动梦灵 | Original post link

Only one

| username: 连连看db | Original post link

Is MySQL DDL executed directly or using a tool?

| username: 舞动梦灵 | Original post link

This DDL was executed using gh-ost. I added the online DDL parameter and also added the parameter to filter create index. However, gh-ost internally generates and converts the create index, so it still executes and causes an error. I chose to skip this error, and there was no problem. Synchronization started, but it frequently prompted primary key conflicts. After executing recovery, it was normal. It continued to synchronize in real-time without any prompts as long as the delay was within 3 minutes.

| username: TiDBer_5Vo9nD1u | Original post link

Is the application also writing data while the synchronization program is synchronizing data?

| username: 舞动梦灵 | Original post link

No. The target database only has query code connections. If there are applications writing data to the target, it will definitely crash. And it’s already been five or six years.

| username: 有猫万事足 | Original post link

Convert the binlog file into text.

mysqlbinlog --base64-output=DECODE-ROWS -v mysql-bin.000095 >binlog.sql

Check carefully, because the conflict is with the same ID, it should be relatively easy to find in the log.

| username: 舞动梦灵 | Original post link

Query on the MySQL side? The primary key ID can be found on the MySQL side. It cannot be found on the target TiDB side. Execute the recovery command, and the primary key ID will be synchronized over. TiDB will then be able to find it.

| username: 随便改个用户名 | Original post link

Could you show the table structure of TiDB?

| username: 舞动梦灵 | Original post link

The same, it’s not just one table, multiple tables have the prompt, but executing the recovery command is normal.