TiDB DDL Execution Time is Long, Rollback Time is Long

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB DDL执行时间长,回滚时间长

| username: dgtgsou

[TiDB Usage Environment] Production Environment
[TiDB Version] V6.5.5
[Reproduction Path] Adding an index, unable to add successfully.
[Encountered Problem: Phenomenon and Impact] Adding an index, unable to add successfully.
A table with 1500 rows, adding an index has not completed in 1 day.
admin show ddl jobs always shows the status as running.

This table has undergone stress testing, with the data volume increasing from 1500 rows to 10 million rows.
A composite index of 2 columns was added to the 10 million rows, but it got stuck and did not move. Finally, admin cancel ddl jobs was executed, and it showed rollback success after 8 hours.

Initially thought it was due to the large data volume, but later performed stress test data cleanup, reducing the data volume back to 1500 rows, and performed analyze table xxx, with the row count displayed normally.

| username: TiDBer_QYr0vohO | Original post link

Check the TiDB server logs, are there any error logs?

| username: zhaokede | Original post link

Adding an index to a table with 1500 rows should normally succeed within seconds;
Check the logs for any errors or if the DDL operation has not completed.

| username: zhanggame1 | Original post link

How is the data cleaning work done?
Take a look at show table regions.

| username: WalterWj | Original post link

During the rolling restart of tidb-server, it might be stuck somewhere. Upgrade to the latest 65x version.

| username: dgtgsou | Original post link

DDL operations on other tables are executed normally, and concurrent execution is possible in V6.5.5.

| username: dgtgsou | Original post link

Restarted the tidb-server, the result is the same.

| username: dgtgsou | Original post link

Delete 10,000 rows in a loop, and after deletion is complete, analyze the table.

| username: WalterWj | Original post link

:thinking: Could it be a partitioned table with many partitions?
If the hardware meets the requirements, adding an index to a 10 million row table should only take about 3 minutes. It shouldn’t be very slow.

| username: dgtgsou | Original post link

There are no hardware issues. Adding an index to other tables with 30 million rows, although time-consuming, can be completed normally.

| username: dgtgsou | Original post link

I upgraded from version v5.4.3 to v6.5.5, and the /tmp/tidb directory was not automatically created by the cluster, resulting in an error stating that the directory does not exist. It seems that the Fast Online DDL feature, which is enabled by default in this new version, did not automatically create the directory. Even after I manually created and authorized it on all TiDB nodes, it still had no effect. It always feels like it’s a large table, not a small table with 1000 rows.

| username: DBAER | Original post link

Marking this, need to follow up later. There are many DDL execution failures on the forum.

| username: TIDB-Learner | Original post link

The stress test is not limited to this table, right? Observe the resource situation of the operating system, such as disk capacity.

| username: dgtgsou | Original post link

show table xxx regions;

The results are significantly more

Production database: 1500 rows, show table xxx regions shows 2185 rows
Test database: 10614185 rows, show table xxx regions shows 120 rows

| username: 裤衩儿飞上天 | Original post link

Did you check for metadata locks at that time?

select * from mysql.tidb_mdl_view;
| username: zhanggame1 | Original post link

Deleting 10,000 rows in a loop still leaves the data, and automatic compaction might take several days to clean up, so it’s slow.
I suggest creating a new table like the old table;
Then insert into the new table select * from the old table;
Delete the old table, rename the new table, and then try creating the index again.

| username: dgtgsou | Original post link

Thank you very much. This is the final plan. Now I’m hoping it can be resolved smoothly.

| username: dgtgsou | Original post link

No lock

| username: 小龙虾爱大龙虾 | Original post link

When adding an index, execute the admin show ddl jobs statement. If the state is in the write reorganization phase and the ROW_COUNT is continuously increasing, it indicates that data is being backfilled and the index creation is normal. If it is not in this state, then something is wrong. Check the MDL lock, and if that doesn’t work, check the logs. If all else fails, try restarting.

| username: mono | Original post link

I thought it was 15 billion rows. That would probably take more than an hour.