Indexing is stuck, how to identify the cause?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 加索引卡主,如何确认是什么原因呢?

| username: DBRE

【TiDB Usage Environment】Production Environment
【TiDB Version】5.0.2
【Encountered Problem: Phenomenon and Impact】
Executing the following DDL has been stuck for over 3 hours:

alter table tb_xxx add index idx_yyyy(commissioner_id, month_id, city_id);

During this period, the number of rows in the table decreased from around 20 million to about 10 million, and then slowly increased to around 19 million. This was caused by delete operations followed by insert operations.

The row_count in admin show ddl jobs has always been 0. Additionally, approximately every 20 minutes, the following type of log is output:

[2022/12/28 14:26:47.351 +08:00] [INFO] [ddl_worker.go:861] ["[ddl] wait latest schema version changed"] [worker="worker 2, tp add index"] [ver=3484] ["take time"=53.953994ms] [job="ID:4476, Type:add index, State:running, SchemaState:write reorganization, SchemaID:51, TableID:214, RowCount:0, ArgLen:0, start time: 2022-12-28 11:13:41.69 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2022/12/28 14:26:47.353 +08:00] [INFO] [ddl_worker.go:678] ["[ddl] run DDL job"] [worker="worker 2, tp add index"] [job="ID:4476, Type:add index, State:running, SchemaState:write reorganization, SchemaID:51, TableID:214, RowCount:0, ArgLen:0, start time: 2022-12-28 11:13:41.69 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2022/12/28 14:42:59.101 +08:00] [INFO] [ddl_worker.go:861] ["[ddl] wait latest schema version changed"] [worker="worker 2, tp add index"] [ver=3485] ["take time"=53.108472ms] [job="ID:4476, Type:add index, State:running, SchemaState:write reorganization, SchemaID:51, TableID:214, RowCount:0, ArgLen:0, start time: 2022-12-28 11:13:41.69 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2022/12/28 14:42:59.102 +08:00] [INFO] [ddl_worker.go:678] ["[ddl] run DDL job"] [worker="worker 2, tp add index"] [job="ID:4476, Type:add index, State:running, SchemaState:write reorganization, SchemaID:51, TableID:214, RowCount:0, ArgLen:0, start time: 2022-12-28 11:13:41.69 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2022/12/28 15:02:49.460 +08:00] [INFO] [ddl_worker.go:861] ["[ddl] wait latest schema version changed"] [worker="worker 2, tp add index"] [ver=3486] ["take time"=53.689792ms] [job="ID:4476, Type:add index, State:running, SchemaState:write reorganization, SchemaID:51, TableID:214, RowCount:0, ArgLen:0, start time: 2022-12-28 11:13:41.69 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2022/12/28 15:02:49.461 +08:00] [INFO] [ddl_worker.go:678] ["[ddl] run DDL job"] [worker="worker 2, tp add index"] [job="ID:4476, Type:add index, State:running, SchemaState:write reorganization, SchemaID:51, TableID:214, RowCount:0, ArgLen:0, start time: 2022-12-28 11:13:41.69 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]

What is the reason for the slow execution of this DDL? Is it due to the large data changes in the table? Additionally, how can this problem be resolved?

| username: tidb菜鸟一只 | Original post link

Your data has changed so much, the index hasn’t been built yet, and the corresponding rows are gone…

| username: h5n1 | Original post link

Run admin show ddl jobs to check, and use explain analyze to look at a full table scan SQL. Post the results to see if there are many MVCC versions.

| username: tidb菜鸟一只 | Original post link

Looking at this post, it should be quite similar to your issue. There are too many MVCC versions of the data, and the old version of TiDB, 5.0, should have problems. See if the methods in the post can solve it: 900w表添加索引超过一天还没执行完,如何排查? - TiDB 的问答社区

| username: DBRE | Original post link

The current number of rows in the table is maintained at 21,247,099. After re-executing ALTER TABLE, the admin show ddl jobs still shows that row_count has not changed.

explain analyze select count(*) from tb_xxx;
±---------------------------±------------±---------±----------±-----------------------------------------------------------------±-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------±----------------------------------±--------±-----+
| id | estRows | actRows | task | access object | execution info | operator info | memory | disk |
±---------------------------±------------±---------±----------±-----------------------------------------------------------------±-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------±----------------------------------±--------±-----+
| StreamAgg_32 | 1.00 | 1 | root | | time:105.3ms, loops:2 | funcs:count(Column#33)->Column#19 | 9.49 KB | N/A |
| └─IndexReader_33 | 1.00 | 4299 | root | | time:105.1ms, loops:6, cop_task: {num: 4299, max: 3.52ms, min: 190.9µs, avg: 285.1µs, p95: 377.5µs, rpc_num: 4299, rpc_time: 1.19s, copr_cache_hit_ratio: 1.00} | index:StreamAgg_8 | 3.19 KB | N/A |
| └─StreamAgg_8 | 1.00 | 4299 | cop[tikv] | | tikv_task:{proc max:576ms, min:74ms, p80:311ms, p95:421ms, iters:25037, tasks:4299}, scan_detail: {total_process_keys: 0, total_keys: 0, rocksdb: {delete_skipped_count: 0, key_skipped_count: 0, block: {cache_hit_count: 0, read_count: 0, read_byte: 0 Bytes}}} | funcs:count(1)->Column#33 | N/A | N/A |
| └─IndexFullScan_28 | 21,247,099.00 | 21,247,099 | cop[tikv] | table:tb_xxx, index:idx_dt(cal_dt) | tikv_task:{proc max:576ms, min:74ms, p80:311ms, p95:421ms, iters:25037, tasks:4299}, scan_detail: {total_process_keys: 0, total_keys: 0, rocksdb: {delete_skipped_count: 0, key_skipped_count: 0, block: {cache_hit_count: 0, read_count: 0, read_byte: 0 Bytes}}} | keep order:false, stats:pseudo | N/A | N/A |
±---------------------------±------------±---------±----------±-----------------------------------------------------------------±-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------±----------------------------------±--------±-----+

| username: DBRE | Original post link

So is there any solution in this scenario?

| username: h5n1 | Original post link

Explain analyze select max(non-indexed column) to see, mainly to check if total_process_keys: related MVCC is relatively high, which could be causing long scan times due to historical versions or data that cannot be GC’d.

| username: DBRE | Original post link

The EXPLAIN ANALYZE result for the query SELECT MAX(sign_cnt) FROM tb_xxx; is as follows:

id estRows actRows task access object execution info operator info memory disk
StreamAgg_9 1.00 1 root time:1m33.9s, loops:2 funcs:max(tdb58_ifang_stats.tb_xxx.sign_cnt)->Column#19 1.45 KB N/A
└─TopN_10 1.00 1 root time:1m33.9s, loops:2 tdb58_ifang_stats.tb_xxx.sign_cnt:desc, offset:0, count:1 200 Bytes N/A
└─TableReader_18 1.00 46 root time:1m33.9s, loops:3, cop_task: {num: 4931, max: 1.07s, min: 58.8ms, avg: 284.8ms, p95: 494.2ms, max_proc_keys: 511093, p95_proc_keys: 0, tot_proc: 21m23.6s, tot_wait: 1m58.5s, rpc_num: 4932, rpc_time: 23m24.4s, copr_cache_hit_ratio: 0.00} data:TopN_17 463 Bytes N/A
└─TopN_17 1.00 46 cop[tikv] tikv_task:{proc max:605ms, min:51ms, p80:358ms, p95:422ms, iters:25649, tasks:4931}, scan_detail: {total_process_keys: 21247099, total_keys: 3936959991, rocksdb: {delete_skipped_count: 489891, key_skipped_count: 3939966387, block: {cache_hit_count: 1375902, read_count: 2751371, read_byte: 18.4 GB}}} tdb58_ifang_stats.tb_xxx.sign_cnt:desc, offset:0, count:1 N/A N/A
└─TableFullScan_16 21247099.00 21247099 cop[tikv] table:tb_xxx tikv_task:{proc max:605ms, min:51ms, p80:357ms, p95:421ms, iters:25649, tasks:4931}, scan_detail: {total_process_keys: 0, total_keys: 0, rocksdb: {delete_skipped_count: 0, key_skipped_count: 0, block: {cache_hit_count: 0, read_count: 0, read_byte: 0 Bytes}}} keep order:false, stats:pseudo N/A N/A
| username: h5n1 | Original post link

The difference is quite large. Refer to the one posted above, which seems to be your post as well.

| username: DBRE | Original post link

Then it’s unsolvable :joy:

| username: h5n1 | Original post link

You can consider upgrading.

| username: DBRE | Original post link

Currently, there are no plans for an upgrade. I will first communicate with the business to see if we can resolve this by rebuilding the table and then importing it.

| username: h5n1 | Original post link

Minor version upgrades should be fine, and there are many bug fixes.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.