Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: DM同步延迟,日志里面一个dmlsql时间[“cost time”=12.231926599s]要10s左右
DM synchronization has been delayed since 11 AM last Friday. Initially, I thought it was due to a large data volume in the morning and expected it to recover by the evening, but the delay persisted. The delay increased from half an hour at 11 AM to 3 hours in the afternoon. By 7-8 PM, the delay was 5 hours, and the next morning at 9 AM, it was back to half an hour, repeating this cycle.
Here is a part of the log data:
[2024/01/22 10:03:31.822 +08:00] [WARN] [db.go:285] [“execute transaction”] [task=tccore] [unit=“binlog replication”] [query=“[INSERT INTO tc_xxt
.t_table_name
xxxxxxxx~~~2024-01-22 10:02:41 2024-01-22 10:02:41] [0ace7bb1493d4b619dca5aaf92d83b60 d678eb18d236458aa61dd282e0bcf347 5b002060e9d1423cb216168b9cd8a9af …”] [“cost time”=12.231926599s]
Actions taken: Previously, there were many write conflicts in the database. The query database mode result was empty, so on Friday morning, I changed it to pessimistic locking, and then the delay started. On Saturday afternoon, I changed it back to empty, but the delay did not recover.
You can take a look at the slow SQL analysis of the downstream TiDB.
Extract the statements to see where the slowdown is.
However, we need to first check the database’s performance load to see if there have been any anomalies recently.
It’s been like this for several months.
+1, does this table have a unique index and also have multiple-rows: true
enabled?
Where can I see this parameter? Synchronizing several tables. Full database synchronization. It’s not just this one table.
Uh, ignore this parameter. I just noticed that your version is 4.0.2, which doesn’t support setting multiple-rows…
Just check if the slow table has a unique index and the load situation of the downstream TiDB cluster.
Synchronization has slowed down. If there are no issues with the DM task, it is likely that some optimizations are needed downstream.
Check the read and write latency of the downstream cluster, TiKV hotspots, and slow queries. Start troubleshooting from these areas.
These are all business tables. There are dozens of fields such as primary key index and date. The load seems normal, around 2-3. CPU usage is around 100-200%, which should be relatively low for a 16-core processor.
Well, it can only be a TiDB issue at the moment. Or there is a sudden increase in data on the source end.
OP, please analyze the situation of slow queries in the downstream cluster.
I have read a lot, and most of them are about slow inserts during the parameter phase. According to the documentation, this phase usually involves write conflicts. The database was upgraded from version 2.0 to 4.0.2. It should be in optimistic mode because a few days ago, write conflicts were affecting the business. On Friday morning, it was switched to pessimistic mode, which resulted in severe delays. It was switched back to optimistic mode on Saturday. It seems to have recovered by Monday.
If there is no significant improvement in performance, it should either be a configuration issue or a problem with the SQL itself.
It is recommended to first determine if there is an issue with the SQL itself.
SQL consists entirely of “INSERT VALUES” with specified fixed values and a fixed number of parameters being transmitted.
Has there been any change in the traffic situation of the business accessing the cluster? Based on your description of the phenomenon, it might be quite related to the business access.
There hasn’t been much change in the business, and the cost time has always been around 1-12 seconds. I suspect that this delay started when I changed to pessimistic locking on Friday morning, then changed it to null on Saturday, and synchronization returned to normal early Monday morning.
Yes, if conditions permit, you can simulate and verify the conclusion in a test environment to avoid encountering similar issues in the production environment again.
There shouldn’t be so many write conflicts during normal synchronization.
If there are many write conflicts during data synchronization, extra attention should be paid to whether it’s due to shard merging, and whether the primary key of the merged table lacks the shard key, causing write conflicts.
In this case, it indicates an error in the synchronization logic.