DDL Queue Blockage, All DDLs Cannot Be Executed

translator_bot · June 22, 2024, 6:17pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: DDL 队列阻塞，所有DDL均无法执行

| username: Belial

[TiDB Usage Environment] Test Environment
[TiDB Version] v5.4.3
[Reproduction Path]
Using tiup to upgrade the cluster, v3.0.19 → 4.0.16 → v5.4.3 were all normal, but the upgrade from v5.4.3 to v6.5.0 failed during the process of restarting TiDB. Checking the logs revealed that the DDL queue was stuck, repeatedly logging the following:
[2023/02/04 19:25:34.130 +08:00] [INFO] [ddl_worker.go:932] [“[ddl] wait latest schema version changed”] [worker=“worker 1, tp general”] [ver=37633] [“take time”=54.954686ms] [job=“ID:32748, Type:modify column, State:done, SchemaState:public, SchemaID:3, TableID:19, RowCount:0, ArgLen:0, start time: 2023-02-03 19:04:57.588 +0800 CST, Err:, ErrCount:0, SnapshotVersion:0”]
[2023/02/04 19:25:34.130 +08:00] [INFO] [ddl_worker.go:906] [“[ddl] schema version doesn’t change”] [worker=“worker 1, tp general”]
[2023/02/04 19:25:34.136 +08:00] [ERROR] [delete_range.go:101] [“[ddl] add job into delete-range table failed”] [jobID=32748] [jobType=“modify column”] [error=“json: cannot unmarshal object into Go value of type int64”]
[2023/02/04 19:25:34.136 +08:00] [WARN] [ddl_worker.go:201] [“[ddl] handle DDL job failed”] [worker=“worker 1, tp general”] [error=“json: cannot unmarshal object into Go value of type int64”]

translator_bot · June 22, 2024, 6:17pm

| username: xfworld | Original post link

If it doesn’t work, just cancel it.

Refer to this command:

translator_bot · June 22, 2024, 6:17pm

| username: Belial | Original post link

I tried to cancel it, but after execution, it remains in the cancelling state for a long time.

translator_bot · June 22, 2024, 6:17pm

| username: xfworld | Original post link

If the cancellation is not successful and it remains “cancelling…”, then that would be bad.

translator_bot · June 22, 2024, 6:17pm

| username: tidb菜鸟一只 | Original post link

Generally, it is required that there are no ongoing DDL operations during the upgrade process.

translator_bot · June 22, 2024, 6:17pm

| username: Belial | Original post link

It is indeed possible that DDL occurred during the upgrade process. Are there any remedies for these situations?

translator_bot · June 22, 2024, 6:17pm

| username: xfworld | Original post link

In the test environment, just back up the data yourself, dismantle and rebuild it…

translator_bot · June 22, 2024, 6:17pm

| username: Belial | Original post link

Actually, I still want to solve this problem on this cluster. Through this upgrade, it shows that sometimes it may not be possible to completely avoid DDL during the upgrade, as the business itself may not remember when a certain task will automatically perform DDL.

translator_bot · June 22, 2024, 6:17pm

| username: xingzhenxiang | Original post link

Before upgrading, you can use admin show ddl to check if there are any executing SQL statements. The create table statement should complete quickly. In my experience, long-running SQL statements usually occur when adding indexes.