Version 6.5 gets stuck during DM synchronization of DDL and cannot be canceled

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 6.5版本在DM同步DDL时候被卡死,没办法取消

| username: dba-kit

DM synchronized a small table’s DDL to the TiDB server, and the TiDB server was stuck. Using ADMIN CANCEL DDL JOBS to cancel the DDL also kept failing. After nearly four hours, it finally succeeded in canceling. During this period, there were many [2023/05/24 19:09:28.673 +08:00] [INFO] [syncer.go:333] ["[ddl] syncer check all versions, someone is not synced"] [info="instance ip {tidb-server-ip} port 4000, id 831b35bd-9d31-4ec4-a005-08ac7c6ead64"] ["ddl id"=47401] [ver=33590] errors.

Filtered logs are as follows:

[2023/05/24 19:09:28.569 +08:00] [INFO] [ddl_worker.go:314] ["[ddl] add DDL jobs"] ["batch count"=1] [jobs="ID:47401, Type:add column, State:queueing, SchemaState:none, SchemaID:5747, TableID:7466, RowCount:0, ArgLen:4, start time: 2023-05-24 19:09:28.503 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0; "] [table=true]
[2023/05/24 19:09:28.569 +08:00] [INFO] [ddl.go:1011] ["[ddl] start DDL job"] [job="ID:47401, Type:add column, State:queueing, SchemaState:none, SchemaID:5747, TableID:7466, RowCount:0, ArgLen:4, start time: 2023-05-24 19:09:28.503 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"] [query="ALTER TABLE `db1`.`tb1` ADD COLUMN `col1` VARCHAR(30) NULL COMMENT '收益止盈类型, FIXED:固定值, RANGE:范围值'"]
[2023/05/24 19:09:28.589 +08:00] [INFO] [ddl_worker.go:1168] ["[ddl] run DDL job"] [worker="worker 3, tp general"] [job="ID:47401, Type:add column, State:queueing, SchemaState:none, SchemaID:5747, TableID:7466, RowCount:0, ArgLen:0, start time: 2023-05-24 19:09:28.503 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2023/05/24 19:09:28.592 +08:00] [INFO] [column.go:133] ["[ddl] run add column job"] [job="ID:47401, Type:add column, State:running, SchemaState:none, SchemaID:5747, TableID:7466, RowCount:0, ArgLen:4, start time: 2023-05-24 19:09:28.503 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"] [columnInfo="{\"id\":45,\"name\":{\"O\":\"col1\",\"L\":\"col1\"},\"offset\":44,\"origin_default\":null,\"origin_default_bit\":null,\"default\":null,\"default_bit\":null,\"default_is_expr\":false,\"generated_expr_string\":\"\",\"generated_stored\":false,\"dependences\":null,\"type\":{},\"state\":0,\"comment\":\"收益止盈类型, FIXED:固定值, RANGE:范围值\",\"hidden\":false,\"change_state_info\":null,\"version\":2}"]
[2023/05/24 19:09:28.634 +08:00] [INFO] [domain.go:726] ["mdl gets lock, update to owner"] [jobID=47401] [version=33590]
[2023/05/24 19:09:28.673 +08:00] [INFO] [syncer.go:333] ["[ddl] syncer check all versions, someone is not synced"] [info="instance ip 192.168.1.221, port 4000, id 65867468-6375-4732-b2e7-1501df354479"] ["ddl id"=47401] [ver=33590]
[2023/05/24 19:09:28.673 +08:00] [INFO] [syncer.go:333] ["[ddl] syncer check all versions, someone is not synced"] [info="instance ip 192.168.1.245, port 4000, id 831b35bd-9d31-4ec4-a005-08ac7c6ead64"] ["ddl id"=47401] [ver=33590]

# Many logs filtered out, mostly `syncer check all versions, someone is not synced` errors

[2023/05/25 01:14:22.975 +08:00] [INFO] [ddl_worker.go:1377] ["[ddl] wait latest schema version changed(get the metadata lock if tidb_enable_metadata_lock is true)"] [ver=33590] ["take time"=6h4m54.355850116s] [job="ID:47401, Type:add column, State:running, SchemaState:delete only, SchemaID:5747, TableID:7466, RowCount:0, ArgLen:4, start time: 2023-05-24 19:09:28.503 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2023/05/25 01:14:22.989 +08:00] [INFO] [ddl_worker.go:1168] ["[ddl] run DDL job"] [worker="worker 3, tp general"] [job="ID:47401, Type:add column, State:cancelling, SchemaState:delete only, SchemaID:5747, TableID:7466, RowCount:0, ArgLen:0, start time: 2023-05-24 19:09:28.503 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2023/05/25 01:14:22.992 +08:00] [INFO] [rollingback.go:433] ["[ddl] the DDL job is cancelled normally"] [worker="worker 3, tp general"] [job="ID:47401, Type:add column, State:rollingback, SchemaState:delete only, SchemaID:5747, TableID:7466, RowCount:0, ArgLen:1, start time: 2023-05-24 19:09:28.503 +0800 CST, Err:[ddl:8214]Cancelled DDL job, ErrCount:1, SnapshotVersion:0"] [error="[ddl:8214]Cancelled DDL job"]
[2023/05/25 01:14:23.004 +08:00] [INFO] [ddl_worker.go:1168] ["[ddl] run DDL job"] [worker="worker 3, tp general"] [job="ID:47401, Type:add column, State:rollingback, SchemaState:delete only, SchemaID:5747, TableID:7466, RowCount:0, ArgLen:0, start time: 2023-05-24 19:09:28.503 +0800 CST, Err:[ddl:8214]Cancelled DDL job, ErrCount:1, SnapshotVersion:0"]
[2023/05/25 01:14:23.084 +08:00] [INFO] [domain.go:726] ["mdl gets lock, update to owner"] [jobID=47401] [version=33626]
[2023/05/25 01:15:21.269 +08:00] [INFO] [ddl_worker.go:1377] ["[ddl] wait latest schema version changed(get the metadata lock if tidb_enable_metadata_lock is true)"] [ver=33626] ["take time"=58.239228943s] [job="ID:47401, Type:add column, State:rollingback, SchemaState:delete reorganization, SchemaID:5747, TableID:7466, RowCount:0, ArgLen:1, start time: 2023-05-24 19:09:28.503 +0800 CST, Err:[ddl:8214]Cancelled DDL job, ErrCount:1, SnapshotVersion:0"]
[2023/05/25 01:15:21.280 +08:00] [INFO] [ddl_worker.go:1168] ["[ddl] run DDL job"] [worker="worker 3, tp general"] [job="ID:47401, Type:add column, State:rollingback, SchemaState:delete reorganization, SchemaID:5747, TableID:7466, RowCount:0, ArgLen:0, start time: 2023-05-24 19:09:28.503 +0800 CST, Err:[ddl:8214]Cancelled DDL job, ErrCount:1, SnapshotVersion:0"]
[2023/05/25 01:15:21.334 +08:00] [INFO] [domain.go:726] ["mdl gets lock, update to owner"] [jobID=47401] [version=33627]
[2023/05/25 01:15:21.379 +08:00] [INFO] [ddl_worker.go:1377] ["[ddl] wait latest schema version changed(get the metadata lock if tidb_enable_metadata_lock is true)"] [ver=33627] ["take time"=74.139844ms] [job="ID:47401, Type:add column, State:rollback done, SchemaState:none, SchemaID:5747, TableID:7466, RowCount:0, ArgLen:1, start time: 2023-05-24 19:09:28.503 +0800 CST, Err:[ddl:8214]Cancelled DDL job, ErrCount:1, SnapshotVersion:0"]
[2023/05/25 01:15:21.393 +08:00] [INFO] [ddl_worker.go:643] ["[ddl] finish DDL job"] [worker="worker 3, tp general"] [job="ID:47401, Type:add column, State:rollback done, SchemaState:none, SchemaID:5747, TableID:7466, RowCount:0, ArgLen:0, start time: 2023-05-24 19:09:28.503 +0800 CST, Err:[ddl:8214]Cancelled DDL job, ErrCount:1, SnapshotVersion:0"]
[2023/05/25 01:15:21.440 +08:00] [INFO] [ddl.go:1114] ["[ddl] DDL job is failed"] [jobID=47401]
| username: huhaifeng | Original post link

Hello, I would like to know about your DM task. How many upstream instances are there? What is the DDL? How much business pressure is there?

| username: Hacker007 | Original post link

There are many errors related to DDL incompatibility, but I haven’t seen it get stuck like you mentioned. Could it be that yours is locked?

| username: Hacker007 | Original post link

mdl gets lock

| username: zhanggame1 | Original post link

Is it possible that the target database is running other DDL operations? DDL operations cannot be concurrent.

| username: dba-kit | Original post link

It is the synchronization of a single MySQL instance, but this issue is actually unrelated to DM. It should be a problem with the metadata_lock of this table, as any DDL operation on this table will be stuck for a long time (usually 3 to 4 hours, but it will eventually succeed). Other tables do not have this issue.

| username: huhaifeng | Original post link

Master is awesome.
It is likely a metadata lock issue, see the issue here: Schema version update failure may cause DDL job to get stuck · Issue #43755 · pingcap/tidb · GitHub

| username: dba-kit | Original post link

The problem is most likely this. Today, I looked at the release notes for version 7.1 and saw that it has already been merged into the official 7.1 branch.

| username: dba-kit | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.