Abnormal Execution Efficiency When Using _tidb_rowid for Small Range Queries

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 使用_tidb_rowid进行小范围查询时执行效率异常

| username: realcp1018

【TiDB Usage Environment】Production Environment
【TiDB Version】5.0.2
【Description】
Recently encountered an abnormal situation when using a batch update tool to clean up data in a table, so I’m sharing it here.
The tool was introduced before: Using tidb-toolkit for batch deletion/update of data

The original SQL is:

delete from db.t where dt < '2023-06-01';

The execution log of the generated batch delete SQL is:

[2023-08-23 16:57:11,381] [INFO] [tk_chunk_update.py:215]: chunk 131491 Done [split_time=0:00:00.014420] [duration=0:00:41.683770] [rows=1000] [sql=DELETE FROM db.t WHERE dt < '2023-06-01' and (`t`.`_tidb_rowid` >= 252168339 and `t`.`_tidb_rowid` < 252169339)]
[2023-08-23 16:57:11,390] [INFO] [tk_chunk_update.py:215]: chunk 131498 Done [split_time=0:00:00.013967] [duration=0:00:40.087776] [rows=1000] [sql=DELETE FROM db.t WHERE dt < '2023-06-01' and (`t`.`_tidb_rowid` >= 252175339 and `t`.`_tidb_rowid` < 252176339)]
[2023-08-23 16:57:11,391] [INFO] [tk_chunk_update.py:302]: write savepoint 252178339, complete percent: 73.18%

Here is the split SQL and table structure:
DELETE FROM db.t WHERE dt < '2023-06-01' and (t._tidb_rowid>= 252168339 andt._tidb_rowid < 252169339)

CREATE TABLE `t` (
  `book_id` int(11) NOT NULL COMMENT 'Primary Key',
  `dt` date NOT NULL,
  `entity_id` bigint(20) NOT NULL,
  `content_type` int(11) NOT NULL DEFAULT '0',
  `book_name` varchar(255) DEFAULT NULL,
  `inter_name` varchar(255) DEFAULT NULL,
  `author` varchar(255) DEFAULT NULL,
  `translator` varchar(255) DEFAULT NULL,
  `contract_id` bigint(10) DEFAULT NULL,
  `word_count` bigint(20) DEFAULT NULL,
  `book_type` int(4) DEFAULT NULL,
  `file_type` int(4) DEFAULT NULL,
  `desc_brief` varchar(255) DEFAULT NULL,
  `book_url` varchar(255) DEFAULT NULL,
  `volume_count` bigint(10) DEFAULT NULL,
  `chapter_count` bigint(10) DEFAULT NULL,
  `created_time` datetime DEFAULT NULL,
  `updated_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT 'Last Update Time',
  PRIMARY KEY (`dt`,`book_id`) /*T![clustered_index] NONCLUSTERED */,
  UNIQUE KEY `ix_uniq_entityId` (`dt`,`entity_id`),
  KEY `ix_contractId` (`contract_id`)
)

【The Issue】:
Initially, the execution efficiency of this SQL was very good, and the execution plan showed that rowid was quickly scanned through TableRangeScan.
image

But later, the execution efficiency suddenly slowed down, and the execution plan changed to:


It confusingly used the unique index, causing the execution time of a single SQL to increase from tens of milliseconds to over 40 seconds.
After trying to add the comment /*+ use_index(t) */, the execution plan could be restored to the initial table range scan (the way I expected).
The plan is to add a hint to the batch update tool, using use_index(t,PRIMARY) when rowid is the primary key, and adding use_index(t) when it is the implicit _tidb_rowid, which should enhance stability.

| username: xfworld | Original post link

After a large amount of deletion, the data still exists before GC, which can lead to the following two scenarios:

  • Sampling information is inaccurate, affecting the execution plan.
  • The deletion of a large amount of data will cause additional scanning and jumping actions, resulting in a small amount of loss.

Solutions:

  • Use time partitioning and release partition data through truncate to reduce delete operations, which will be more effective.
  • After a large amount of deletion, execute GC as quickly as possible to release version data.
  • Manually execute data collection to increase the health of the table, or use hints to solve this problem.
| username: tidb菜鸟一只 | Original post link

If the amount of data that meets the condition dt < '2023-06-01' is very small, this SQL will use the index of the dt condition. If there is a lot of data, it will use tablerangescan. It is speculated that initially, there was a lot of data that met the condition, so it used tablerangescan. Later, as the amount of data decreased, it started using the index ix_uniq_entityId for the dt condition.
I think you can check the statistics of this table to see if they are outdated and have become pseudo.

| username: realcp1018 | Original post link

Replying to the two kind people above :heart::

The statistics look normal now, and there is no analyze history. We have a scheduled task to manually update the statistics every day. Additionally, our GC interval is relatively short, with only 25 concurrent tasks, each deleting 1000 rows at a time, making it less likely to cause accumulation.

In terms of data volume, it shouldn’t be enough to abandon tableRangeScan in favor of using the unique key, because there is still a lot of data to be deleted, and compared to the range of 1000 rowids, it is relatively small.

However, it does seem that the change in execution plan is due to changes in the statistics. I don’t think the optimizer has been specifically enhanced in this regard.

The current improvement is to add a hint first and see if there will be any issues in the future. The effect for this instance is quite good:

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.