DM Same Configuration, But Some Tasks Are Abnormal

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: dm 同样配置,但部分任务异常

| username: rebelsre

【TiDB Usage Environment】Production Environment
【TiDB Version】
【Reproduction Path】
Previously, haproxy timeout client 30s caused all tasks to report the same error. After adjusting the configuration back to timeout client 30000s, some tasks are normal, but there are still many tasks reporting errors.
【Encountered Problem: Problem Phenomenon and Impact】
“errors”: [
{
“ErrCode”: 10006,
“ErrClass”: “database”,
“ErrScope”: “not-set”,
“ErrLevel”: “high”,
“Message”: “startLocation: [position: (mysql-bin.1880585, 458966831), gtid-set: 763d006c-c9f6-11eb-9bcf-0c42a13f1770:152418822666-152419880217], endLocation: [position: (mysql-bin.1880585, 458966926), gtid-set: 763d006c-c9f6-11eb-9bcf-0c42a13f1770:152418822666-152419880217]: execute statement failed: begin”,
“RawCause”: “invalid connection”,
“Workaround”: “”
}
],
【Resource Configuration】
【Attachments: Screenshots/Logs/Monitoring】
dm-worker_stdout.log


dm-worker_stderr.log
image
dm-worker.log

| username: xfworld | Original post link

Try connecting directly to TiDB…

| username: rebelsre | Original post link

There are environmental issues, making direct connection difficult, but you can try. However, looking at the time when syncer was killed in dm-worker_stdout.log, it’s only 20 seconds. Are there any timeout configurations or caches on the DM side? Trying to recreate the task with start-task --remove-meta also has no effect.

| username: rebelsre | Original post link

Here is a DM configuration:

name: xxx
task-mode: incremental

target-database:
  host: "127.0.0.1"
  port: 3306
  user: "root"
  password: "xxx"

syncers:
  global:
    worker-count: 128
    batch: 500
    compact: true
    multiple-rows: true

block-allow-list:
  ba-rule1:
    do-dbs: ["xxx"]

mysql-instances:
  - source-id: "xxx"
    meta:
      binlog-name: mysql-bin.1880577
      binlog-pos: 475808582
    block-allow-list: "ba-rule1"
    syncer-config-name: "global"
| username: rebelsre | Original post link

Bypassing haproxy and using socat forwarding, the same issue still occurs.

| username: Fly-bird | Original post link

Can you confirm that both upstream and downstream are fine?

| username: xfworld | Original post link

Do not forward, do not proxy, direct connection…

You need to first confirm where the problem is, right?

| username: rebelsre | Original post link

No problem, because some tasks can run normally with the same configuration.

| username: rebelsre | Original post link

Environmental issues make it impossible to achieve absolute direct connection.

| username: TiDBer_小阿飞 | Original post link

Check the monitoring under cluster_tidb → kv errors for any locks or backoffs?

| username: dba-kit | Original post link

This error is reported by the Golang driver. If it occurs during the full synchronization dump phase, check the wait_timeout of the upstream MySQL. If it occurs during incremental synchronization, check the wait_timeout parameter of the downstream TiDB. Both invalid connection and bad connection indicate that the connection in DM is normal but was unilaterally interrupted by the server (MySQL/TiDB). Investigate the parameters that might interrupt the connection.

| username: rebelsre | Original post link

| username: dba-kit | Original post link

If the error occurs during the synchronization phase, it is highly likely that the table is not updated frequently. No data is written for a period longer than wait_timeout, and when data is suddenly written, it is found that the connection has already been killed by the downstream.

| username: TiDBer_小阿飞 | Original post link

Is your txnLock always this high? Check if there is scheduling in the PD logs.

| username: TiDBer_小阿飞 | Original post link

Just check if there is a “leader changed” during DM operation.

| username: rebelsre | Original post link

Yes, the leader node keeps refreshing, but the leader has not switched to other nodes. What impact will this have?

| username: TiDBer_小阿飞 | Original post link

If the PD node frequently changes the leader due to hot reads or hot writes caused by other factors, and the PD undergoes leader scheduling, it will cause backoff. Your DM task will be blocked or even interrupted.

| username: TiDBer_小阿飞 | Original post link

As for the reason for the lock, you need to look at the specific SQL, what other operations are happening during the same time period, or whether there are deadlock SQLs generated.
information_schema.deadlocks

| username: rebelsre | Original post link

mysql> select * from information_schema.deadlocks limit 1;
Empty set (0.00 sec)

There are no deadlocks after checking.