Adding an index to a 900 million row table has not completed after more than a day, how to troubleshoot?

translator_bot · June 23, 2024, 5:48am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 900w表添加索引超过一天还没执行完，如何排查？

| username: DBRE

[TiDB Usage Environment] Online
[TiDB Version] v5.2.2
[Encountered Problem] Adding an index to a 9 million row table has not completed after more than a day. Running admin show ddl jobs, the ROW_COUNT stops changing after reaching a certain value. Filtering the tidb.log with the ddl keyword shows the following:
[2022/08/24 09:07:20.332 +08:00] [INFO] [reorg.go:284] [“[ddl] run reorg job wait timeout”] [waitTime=5s] [elementType=idx] [elementID=11] [totalAddedRowCount=3966492] [doneKey=1168558586.next]
[2022/08/24 09:07:20.336 +08:00] [INFO] [ddl_worker.go:886] [“[ddl] schema version doesn’t change”] [worker=“worker 2, tp add index”]
[2022/08/24 09:07:20.337 +08:00] [INFO] [ddl_worker.go:727] [“[ddl] run DDL job”] [worker=“worker 2, tp add index”] [job=“ID:5160, Type:add index, State:running, SchemaState:write reorganization, SchemaID:53, TableID:216, RowCount:3966492, ArgLen:0, start time: 2022-08-23 20:34:43.531 +0800 CST, Err:, ErrCount:0, SnapshotVersion:435488839088472086”]
[2022/08/24 09:07:25.401 +08:00] [INFO] [reorg.go:284] [“[ddl] run reorg job wait timeout”] [waitTime=5s] [elementType=idx] [elementID=11] [totalAddedRowCount=3966492] [doneKey=1168558586.next]
[2022/08/24 09:07:25.404 +08:00] [INFO] [ddl_worker.go:886] [“[ddl] schema version doesn’t change”] [worker=“worker 2, tp add index”]
[2022/08/24 09:07:25.405 +08:00] [INFO] [ddl_worker.go:727] [“[ddl] run DDL job”] [worker=“worker 2, tp add index”] [job=“ID:5160, Type:add index, State:running, SchemaState:write reorganization, SchemaID:53, TableID:216, RowCount:3966492, ArgLen:0, start time: 2022-08-23 20:34:43.531 +0800 CST, Err:, ErrCount:0, SnapshotVersion:435488839088472086”]
[2022/08/24 09:07:30.429 +08:00] [INFO] [reorg.go:284] [“[ddl] run reorg job wait timeout”] [waitTime=5s] [elementType=idx] [elementID=11] [totalAddedRowCount=3966492] [doneKey=1168558586.next]
[2022/08/24 09:07:30.432 +08:00] [INFO] [ddl_worker.go:886] [“[ddl] schema version doesn’t change”] [worker=“worker 2, tp add index”]
[2022/08/24 09:07:30.436 +08:00] [INFO] [ddl_worker.go:727] [“[ddl] run DDL job”] [worker=“worker 2, tp add index”] [job=“ID:5160, Type:add index, State:running, SchemaState:write reorganization, SchemaID:53, TableID:216, RowCount:3966492, ArgLen:0, start time: 2022-08-23 20:34:43.531 +0800 CST, Err:, ErrCount:0, SnapshotVersion:435488839088472086”]
[2022/08/24 09:07:35.467 +08:00] [INFO] [reorg.go:284] [“[ddl] run reorg job wait timeout”] [waitTime=5s] [elementType=idx] [elementID=11] [totalAddedRowCount=3966492] [doneKey=1168558586.next]
[2022/08/24 09:07:35.470 +08:00] [INFO] [ddl_worker.go:886] [“[ddl] schema version doesn’t change”] [worker=“worker 2, tp add index”]
[2022/08/24 09:07:35.471 +08:00] [INFO] [ddl_worker.go:727] [“[ddl] run DDL job”] [worker=“worker 2, tp add index”] [job=“ID:5160, Type:add index, State:running, SchemaState:write reorganization, SchemaID:53, TableID:216, RowCount:3966492, ArgLen:0, start time: 2022-08-23 20:34:43.531 +0800 CST, Err:, ErrCount:0, SnapshotVersion:435488839088472086”]

[Reproduction Path] Executing admin cancel ddl jobs jobid and re-executing the index addition still results in the above phenomenon.
How can I troubleshoot which stage the DDL is stuck at and how to resolve it?

translator_bot · June 23, 2024, 5:48am

| username: db_user | Original post link

You can check the status of resources such as CPU, memory, disk space, and IOPS. Additionally, please provide the desensitized TiDB and TiKV logs for the time when the system got stuck. Ideally, processing 9 million records should be quick.

Also, check the configuration of these two parameters: tidb_ddl_reorg_batch_size and tidb_ddl_reorg_worker_cnt.

By the way, see how many rows have been backfilled in the index and how long it took before it got stuck.

translator_bot · June 23, 2024, 5:48am

| username: OnTheRoad | Original post link

Judging by the error, the index backfill timed out. Check the system variables:

tidb_ddl_reorg_worker_cnt (backfill concurrency, default is 4).
tidb_ddl_reorg_batch_size (backfill unit, default is 256).
tidb_ddl_reorg_priority (backfill priority, default is PRIORITY_LOW).
tidb_ddl_error_count_limit (index retry count, default is 512).

Among these, tidb_ddl_reorg_priority is at the SESSION level. Were there a lot of DML operations during the indexing? Try to avoid indexing during peak business hours.

The method to query system variables is as follows:

mysql> show variables like 'tidb_ddl_%';
+----------------------------+--------------+
| Variable_name              | Value        |
+----------------------------+--------------+
| tidb_ddl_error_count_limit | 512          |
| tidb_ddl_reorg_batch_size  | 256          |
| tidb_ddl_reorg_priority    | PRIORITY_LOW |
| tidb_ddl_reorg_worker_cnt  | 4            |
+----------------------------+--------------+
4 rows in set (0.00 sec)

translator_bot · June 23, 2024, 5:48am

| username: DBRE | Original post link

The result of admin show ddl jobs shows that the row_count has been stuck at 3966492 for a long time.

+--------+---------+-----------+-----------+-------------------+-----------+----------+-----------+---------------------+-----------+---------+
| JOB_ID | DB_NAME | TABLE_NAME | JOB_TYPE  | SCHEMA_STATE      | SCHEMA_ID | TABLE_ID | ROW_COUNT | START_TIME          | END_TIME  | STATE   |
+--------+---------+-----------+-----------+-------------------+-----------+----------+-----------+---------------------+-----------+---------+
|   5160 | test    | test      | add index | write reorganization |    53    |     216  |   3966492 | 2022-08-23 20:34:43 | NULL      | running |
+--------+---------+-----------+-----------+-------------------+-----------+----------+-----------+---------------------+-----------+---------+

Configuration:

| tidb_ddl_reorg_batch_size  | 256   |
| tidb_ddl_reorg_worker_cnt  | 4     |

Resource usage:

image1346×380 77.8 KB

image1380×264 51.7 KB

image1342×398 36.7 KB

translator_bot · June 23, 2024, 5:48am

| username: DBRE | Original post link

There aren’t many DML operations during the indexing period, all parameters are set to default.

translator_bot · June 23, 2024, 5:48am

| username: OnTheRoad | Original post link

If the TiKV IO bottleneck is excluded, it is recommended to appropriately increase the values of several system variables returned by show variables like 'tidb_ddl_%';. Be mindful of the impact on business performance. While increasing the speed of adding indexes, it may cause performance fluctuations if the database is busy.

translator_bot · June 23, 2024, 5:48am

| username: buddyyuan | Original post link

Could you please share the Grafana {cluster_name}-TiDB → DDL monitoring?

translator_bot · June 23, 2024, 5:48am

| username: DBRE | Original post link

Sorry, I can’t translate images. Please provide the text you need translated.

translator_bot · June 23, 2024, 5:48am

| username: DBRE | Original post link

translator_bot · June 23, 2024, 5:48am

| username: h5n1 | Original post link

Try re-electing a DDL owner and then adding the index. You can use SELECT tidb_is_ddl_owner(); to check if the current TiDB server is the owner.

translator_bot · June 23, 2024, 5:48am

| username: DBRE | Original post link

I’ll give it a try.

translator_bot · June 23, 2024, 5:48am

| username: wuxiangdong | Original post link

Split the table’s regions, and increase tidb_ddl_reorg_worker_cnt and tidb_ddl_reorg_batch_size.

translator_bot · June 23, 2024, 5:48am

| username: DBRE | Original post link

Tried it, still not working. Adding an index is still very slow, and it takes a long time for row_count to change.

translator_bot · June 23, 2024, 5:48am

| username: h5n1 | Original post link

Could you provide the table structure and index statements?

translator_bot · June 23, 2024, 5:48am

| username: DBRE | Original post link

Table structure:
CREATE TABLE summary (
id bigint(20) NOT NULL AUTO_INCREMENT,
empName varchar(100) DEFAULT NULL,
empUuid varchar(100) DEFAULT NULL,
deptUuid varchar(100) DEFAULT NULL,
deptName varchar(100) DEFAULT NULL,
positionUuid varchar(100) DEFAULT NULL,
positionName varchar(100) DEFAULT NULL,
empGroupUuid varchar(100) DEFAULT NULL,
empGroupName varchar(100) DEFAULT NULL,
versionUuid varchar(100) DEFAULT NULL,
companyUuid varchar(100) DEFAULT NULL,
companyName varchar(100) DEFAULT NULL’,
strokeScore decimal(19,2) DEFAULT NULL,
achieveScore decimal(19,2) DEFAULT NULL,
attendScore decimal(19,2) DEFAULT NULL,
customScore decimal(19,2) DEFAULT NULL,
noReachScore decimal(19,2) DEFAULT NULL,
totalScore decimal(19,2) DEFAULT NULL,
assessDateScore decimal(19,2) DEFAULT NULL,
assessWeekScore decimal(19,2) DEFAULT NULL,
assessMonthScore decimal(19,2) DEFAULT NULL,
cal_dt varchar(50) DEFAULT NULL,
approveScore decimal(19,2) DEFAULT NULL,
PRIMARY KEY (id) /*T![clustered_index] CLUSTERED */,
KEY idx_companyUuid (companyUuid),
KEY idx_empUuid (empUuid),
KEY idx_deptUuid (deptUuid),
KEY idx_positionUuid (positionUuid),
KEY idx_empGroupUuid (empGroupUuid),
KEY idx_versionUuid (versionUuid),
KEY idx_cal_dt (cal_dt)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

Add index statement:
ALTER TABLE summary ADD INDEX idx_cal_dt_companyuuid (cal_dt,companyUuid);

Current performance: The row count runs very fast up to around 3.5 million, then it stalls for a long time at around 4.8 million, and then again at around 5.57 million (it has already been 2 days at this point). It feels like it might take a few more days to complete.

translator_bot · June 23, 2024, 5:48am

| username: forever | Original post link

Check how long the GC is set for. Is there a large amount of data being updated or deleted in between? If the GC time is set too long, it could result in a lot of old versions of the data.

translator_bot · June 23, 2024, 5:48am

| username: DBRE | Original post link

GC-related settings
| tidb_gc_life_time | 10m0s |
| tidb_gc_run_interval | 10m0s |
How to check if there are many versions of the data?
mysql> select count() from summary;
±---------+
| count() |
±---------+
| 9113495 |
±---------+
1 row in set (1 min 6.88 sec)

In the dashboard, the “Coprocessor Read” in the slow SQL section is as follows. Is it related to this?
Visible versions 9.0 Mil
Encountered versions 2.3 Bil Including deleted or overwritten versions that have not been GC’d

translator_bot · June 23, 2024, 5:48am

| username: alfred | Original post link

Is there such a problem when adding indexes to other fields in this table?

translator_bot · June 23, 2024, 5:48am

| username: forever | Original post link

Try creating a table with the same structure and inserting the data to see if the issue is with this specific table or if it affects all tables.

translator_bot · June 23, 2024, 5:48am

| username: DBRE | Original post link

This table hasn’t had any indexes added recently, and now it has slowly added indexes to over 7 million rows. I’ll test it again in a few days after the indexing is complete.