DM synchronization delay, a DML SQL in the log takes around 10 seconds ["cost time"=12.231926599s]

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: DM同步延迟,日志里面一个dmlsql时间[“cost time”=12.231926599s]要10s左右

| username: 舞动梦灵

DM synchronization has been delayed since 11 AM last Friday. Initially, I thought it was due to a large data volume in the morning and expected it to recover by the evening, but the delay persisted. The delay increased from half an hour at 11 AM to 3 hours in the afternoon. By 7-8 PM, the delay was 5 hours, and the next morning at 9 AM, it was back to half an hour, repeating this cycle.

Here is a part of the log data:
[2024/01/22 10:03:31.822 +08:00] [WARN] [db.go:285] [“execute transaction”] [task=tccore] [unit=“binlog replication”] [query=“[INSERT INTO tc_xxt.t_table_name xxxxxxxx~~~2024-01-22 10:02:41 2024-01-22 10:02:41] [0ace7bb1493d4b619dca5aaf92d83b60 d678eb18d236458aa61dd282e0bcf347 5b002060e9d1423cb216168b9cd8a9af …”] [“cost time”=12.231926599s]

Actions taken: Previously, there were many write conflicts in the database. The query database mode result was empty, so on Friday morning, I changed it to pessimistic locking, and then the delay started. On Saturday afternoon, I changed it back to empty, but the delay did not recover.

| username: 小龙虾爱大龙虾 | Original post link

You can take a look at the slow SQL analysis of the downstream TiDB.

| username: oceanzhang | Original post link

Extract the statements to see where the slowdown is.

| username: oceanzhang | Original post link

However, we need to first check the database’s performance load to see if there have been any anomalies recently.

| username: 舞动梦灵 | Original post link

It’s been like this for several months.

| username: dba-kit | Original post link

+1, does this table have a unique index and also have multiple-rows: true enabled?

| username: 舞动梦灵 | Original post link

Where can I see this parameter? Synchronizing several tables. Full database synchronization. It’s not just this one table.

| username: dba-kit | Original post link

Uh, ignore this parameter. I just noticed that your version is 4.0.2, which doesn’t support setting multiple-rows…
Just check if the slow table has a unique index and the load situation of the downstream TiDB cluster.

| username: Jellybean | Original post link

Synchronization has slowed down. If there are no issues with the DM task, it is likely that some optimizations are needed downstream.

Check the read and write latency of the downstream cluster, TiKV hotspots, and slow queries. Start troubleshooting from these areas.

| username: 舞动梦灵 | Original post link

These are all business tables. There are dozens of fields such as primary key index and date. The load seems normal, around 2-3. CPU usage is around 100-200%, which should be relatively low for a 16-core processor.

| username: 舞动梦灵 | Original post link

Well, it can only be a TiDB issue at the moment. Or there is a sudden increase in data on the source end.

| username: Jellybean | Original post link

OP, please analyze the situation of slow queries in the downstream cluster.

| username: 舞动梦灵 | Original post link

I have read a lot, and most of them are about slow inserts during the parameter phase. According to the documentation, this phase usually involves write conflicts. The database was upgraded from version 2.0 to 4.0.2. It should be in optimistic mode because a few days ago, write conflicts were affecting the business. On Friday morning, it was switched to pessimistic mode, which resulted in severe delays. It was switched back to optimistic mode on Saturday. It seems to have recovered by Monday.

| username: oceanzhang | Original post link

If there is no significant improvement in performance, it should either be a configuration issue or a problem with the SQL itself.

| username: oceanzhang | Original post link

It is recommended to first determine if there is an issue with the SQL itself.

| username: 舞动梦灵 | Original post link

SQL consists entirely of “INSERT VALUES” with specified fixed values and a fixed number of parameters being transmitted.

| username: Jellybean | Original post link

Has there been any change in the traffic situation of the business accessing the cluster? Based on your description of the phenomenon, it might be quite related to the business access.

| username: 舞动梦灵 | Original post link

There hasn’t been much change in the business, and the cost time has always been around 1-12 seconds. I suspect that this delay started when I changed to pessimistic locking on Friday morning, then changed it to null on Saturday, and synchronization returned to normal early Monday morning.

| username: Jellybean | Original post link

Yes, if conditions permit, you can simulate and verify the conclusion in a test environment to avoid encountering similar issues in the production environment again.

| username: 有猫万事足 | Original post link

There shouldn’t be so many write conflicts during normal synchronization.

If there are many write conflicts during data synchronization, extra attention should be paid to whether it’s due to shard merging, and whether the primary key of the merged table lacks the shard key, causing write conflicts.
In this case, it indicates an error in the synchronization logic.