DDL Stuck and Cannot Execute

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: DDL卡住不能执行

| username: wwccmm858

[TiDB Usage Environment] Production Environment
[TiDB Version] v6.3.0
[Reproduction Path] Scheduled tasks for adding and deleting partitions, occasionally occurring. Previously, after changing the own_id, DDL would be stuck for 8-10 hours. This time, after changing the own_id, the table is still stuck.
[Encountered Problem: Problem Phenomenon and Impact]
The production environment will add and delete partitions tomorrow. On May 20th, a table got stuck while generating a partition, causing many subsequent DDL tasks to pile up. After restarting the cluster, DDLs for other tables also got stuck. Changing the own_id and restarting the cluster did not solve the problem. Only after more than 8 hours of restarting the cluster, DDL operations for other tables could be executed.
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

| username: h5n1 | Original post link

Is your production also version 6.3.0?

| username: wwccmm858 | Original post link

Yes.

| username: h5n1 | Original post link

The 6.3.0 DMR version is not recommended for production use. Starting from 6.3, the tidb_enable_metadata_lock variable was introduced, which can cause DML to block DDL operations. Since it is newly introduced and not very mature, try setting it to false. If a restart is required, all TiDB servers need to be shut down and then restarted. It is recommended to upgrade to a newer version.

| username: 鱼跃龙门 | Original post link

Take a look at this parameter.

| username: wwccmm858 | Original post link

This parameter is originally off.

| username: wwccmm858 | Original post link

This parameter is OFF.

| username: h5n1 | Original post link

Use the reboot method, don’t restart them one by one, stop all of them and then start again.

| username: FutureDB | Original post link

Does this situation randomly occur on a certain table when adding and deleting partitions regularly?

| username: zhanggame1 | Original post link

TiDB got stuck on DDL and I had to restart all TiDB nodes.
In my experience, don’t use partitioned tables; they don’t make much sense on TiDB.
If you have to use them, it’s best to add all the partitions in advance.

| username: wwccmm858 | Original post link

I directly restarted the cluster.

| username: wwccmm858 | Original post link

Each partitioned table needs to add the corresponding partition.

| username: wwccmm858 | Original post link

Restarting the TiDB node had no effect.

| username: zhanggame1 | Original post link

Is this how you restart it?

| username: 随便改个用户名 | Original post link

Is it only this table that gets stuck? Are the other tables functioning normally?

| username: TiDBer_ZxWlj6A1 | Original post link

Isn’t this supposed to be a job executed in TiKV? Then, shut down the TiDB server and restart it after a while.

| username: TiDBer_H5NdJb5Q | Original post link

Waiting for an official response.

| username: vincentLi | Original post link

Following, TiDB’s early handling of DDL made me feel strange. In fact, many batch jobs use DDL and DML together, and blocking DDL results won’t be worse than blocking DML.

| username: vincentLi | Original post link

After thinking about it, I still want to make some suggestions based on my limited understanding:

  1. Use ADMIN SHOW DDL to see which DDLs are being executed and consider whether it is necessary to kill this job.
  2. If there is no other way, see if upgrading can solve the issue. Version 8.1 should have higher efficiency in handling DDL.
| username: 友利奈绪 | Original post link

Restarting is still the best solution.