Execution of Create Table Hangs

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 执行Create table卡死

| username: CAICAI

【TiDB Version】: 5.7.25-TiDB-v5.4.0
【K8s Version】: v1.20.15
【TiDB Operator Version】: 1.3.2
【Problem Encountered】:
After executing a create table command, it gets stuck and cannot be killed using kill tidb sessionid:

CREATE TABLE `test`  (
  `orderId` varchar(50) NOT NULL COMMENT 'Production Plan Order Number',
  `state` tinyint(3) NULL DEFAULT 10 COMMENT 'Status 10 Not Cut 20 Partially Cut 50 Fully Cut',
  `data` json NULL COMMENT 'Extended Data',
  PRIMARY KEY (`orderId`)
) COMMENT = 'Cutting List';

Using SELECT ID, USER, INSTANCE, INFO FROM INFORMATION_SCHEMA.CLUSTER_PROCESSLIST order by info desc to view the process

Viewing DDL job through admin show ddl jobs as follows:

The state status is always none, the several cancelling statuses in the above image were cancelled by me.

Checking the tidb logs, many errors like the following appear, reported every ten seconds:

[2022/07/06 07:36:03.170 +00:00] [ERROR] [terror.go:307] ["encountered error"] [error=EOF] [stack="github.com/pingcap/tidb/parser/terror.Log\
\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/parser/terror/terror.go:307\
github.com/pingcap/tidb/server.(*Server).onConn\
\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/server.go:516"]
[2022/07/06 07:36:13.171 +00:00] [ERROR] [terror.go:307] ["encountered error"] [error=EOF] [stack="github.com/pingcap/tidb/parser/terror.Log\
\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/parser/terror/terror.go:307\
github.com/pingcap/tidb/server.(*Server).onConn\
\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/server.go:516"]
[2022/07/06 07:36:23.171 +00:00] [ERROR] [terror.go:307] ["encountered error"] [error=EOF] [stack="github.com/pingcap/tidb/parser/terror.Log\
\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/parser/terror/terror.go:307\
github.com/pingcap/tidb/server.(*Server).onConn\
\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/server.go:516"]

Why does executing the DDL command get stuck, and how can I check where the problem is:

Cluster node information:

| username: GreenGuan | Original post link

I have encountered two situations:

  1. There is an unfinished DDL operation. Solution: Wait, or increase the step size to speed up.
  2. TiDB cannot select an owner node. Solution: Restart the TiDB role to let it select one using tiup restart xxx -R tidb (Note: Make sure to restart all).
| username: CAICAI | Original post link

  1. How do I adjust the step size?
  2. Do all TiDB nodes need to be restarted?
| username: CAICAI | Original post link

Moreover, by checking the processlist, I didn’t see any unfinished DDL. However, I can’t kill this DDL now.

| username: GreenGuan | Original post link

Then it might be the second situation. Restart the TiDB node (note that you need to use tiup to restart the entire node with the TiDB role. The business might be interrupted, so it’s recommended to perform this operation during off-peak hours).

| username: xingzhenxiang | Original post link

admin show ddl to check

| username: CAICAI | Original post link

Restarted the TiDB node, but still didn’t elect an owner :joy:

| username: CAICAI | Original post link

The default value of tidb_gc_life_time is 10m, which means that the data deleted within 10 minutes can be restored. If you want to restore the data deleted within 24 hours, you need to set tidb_gc_life_time to 24h.

| username: CAICAI | Original post link

Enter the three TiDB nodes and use select tidb_is_ddl_owner(). It was found that all three nodes returned 0. Restarting the nodes did not result in an owner being elected. Using:

curl -X POST http://10.20.186.225:10080/ddl/owner/resign

resulted in the error: This node is not a DDL owner, can’t be resigned.

| username: xingzhenxiang | Original post link

Run the following on it:

See if using admin cancel ddl can cancel it.

| username: CAICAI | Original post link

Error, strange.

| username: xingzhenxiang | Original post link

ADMIN CANCEL DDL JOBS 12374

| username: CAICAI | Original post link

This command has been executed, but I don’t have an owner. How can I elect a DDL owner now?

| username: GreenGuan | Original post link

How did you restart it? Please post the command.

| username: CAICAI | Original post link

I deployed K8S, just use kubectl delete pod.

| username: 西伯利亚狼 | Original post link

Did the issue get resolved after the restart? You can check the previous logs.

| username: dba-kit | Original post link

Have you ever changed the namespace of the TiDB cluster? Could it be that the internal domain name recorded within TiDB is incorrect?

| username: songxuecheng | Original post link

pd loop delete

| username: Meditator | Original post link

It seems that there is an issue with your PD cluster. You can check the logs of the PD nodes.

| username: CAICAI | Original post link

How can you tell that PD is looping delete?