Why does min(id) perform a full table scan?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: min(id)为什么走全表扫描

| username: TiDB_C罗

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version]
[Reproduction Path] What operations were performed to encounter the issue
[Encountered Issue: Issue Phenomenon and Impact]
[Resource Configuration]
The table structure is:

Create Table: CREATE TABLE `data` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `req_id` char(32) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `source_from` varchar(32) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL DEFAULT '' COMMENT 'Data Source',
  `source_type` varchar(32) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL DEFAULT '1' COMMENT 'Data Type',
  `create_at` bigint(20) unsigned NOT NULL DEFAULT '0' COMMENT 'Creation Time',
  `update_at` bigint(20) unsigned NOT NULL DEFAULT '0' COMMENT 'Update Time',
  `upload_time` datetime DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT 'Upload Time',
  PRIMARY KEY (`id`) /*T![clustered_index] CLUSTERED */,
  KEY `idx_created_at` (`create_at`),
  KEY `idx_updated_at` (`update_at`),
  KEY `i_req_id` (`req_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin AUTO_INCREMENT=3256369687
1 row in set (0.00 sec)

The query is:

SELECT
  min(id)
FROM
  data;

Execution Plan:

mysql> explain SELECT   max(id) FROM   data;
+------------------------------+---------+-----------+-----------------+----------------------------------------------+
| id                           | estRows | task      | access object   | operator info                                |
+------------------------------+---------+-----------+-----------------+----------------------------------------------+
| StreamAgg_9                  | 1.00    | root      |                 | funcs:max(order.data.id)->Column#12 |
| └─Limit_13                   | 1.00    | root      |                 | offset:0, count:1                            |
|   └─TableReader_24           | 1.00    | root      |                 | data:Limit_23                                |
|     └─Limit_23               | 1.00    | cop[tikv] |                 | offset:0, count:1                            |
|       └─TableFullScan_22     | 1.00    | cop[tikv] | table:data      | keep order:true, desc                        |
+------------------------------+---------+-----------+-----------------+----------------------------------------------+
5 rows in set (0.00 sec)
| username: 小龙虾爱大龙虾 | Original post link

In reality, it’s not a full table scan; it’s just that the operator shows TableFullScan (because there’s no other operator to represent a table scan :grinning:). In fact, it stops once the maximum value is obtained. From the execution plan, you can see that there is a limit operator with count:1 above, and the TableFullScan has keep order: true.
For min/max optimization, refer to: Max/Min 函数消除规则 | PingCAP 文档中心

| username: Jellybean | Original post link

During logical optimization, SQL will be rewritten, and min will be changed to order by id limit 1.

| username: Jellybean | Original post link

Under normal circumstances, the index column can be used. How many rows of data does this table have?

| username: TiDB_C罗 | Original post link

However, the execution time is very long, over 40 seconds.

| username: TiDB_C罗 | Original post link

More than 10 million

| username: h5n1 | Original post link

Take a look at the actual plan with explain analyze.

| username: 小龙虾爱大龙虾 | Original post link

Take out the actual execution plan.

| username: zhanggame1 | Original post link

The execution plan shows that the number of rows written is 1. The clustered table is ordered, so you can directly fetch the smallest value from the table. If the execution is slow, there might be an issue. Have you deleted a large amount of data?

| username: dba远航 | Original post link

Use EXPLAIN ANALYZE SELECT max(id) FROM data; to check the actual execution plan.

| username: zxgaa | Original post link

Isn’t the number of affected rows always 1 row?

| username: Kongdom | Original post link

  1. Execute explain analyze select max(id) from data to see. explain analyze is the actual execution plan, while explain is the estimated execution plan, which might not be accurate.
  2. Verified that it indeed performs a table full scan. Even after splitting into two regions, it still performs a table full scan. The only explanation could be that my data volume is too small. After inserting 350,000 records, the query analyzer might consider a full table scan to be faster.
    Max/Min 函数消除规则 | PingCAP 文档中心
| username: 随缘天空 | Original post link

It is indeed abnormal for this amount of data to take more than 40 seconds. Could it be a read hotspot issue? Try changing the primary key type by modifying id bigint(20) unsigned NOT NULL AUTO_INCREMENT to AUTO_RANDOM.

| username: TiDB_C罗 | Original post link

+------------------------------+---------+---------+-----------+-----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+-----------+------+
| id                           | estRows | actRows | task      | access object   | execution info                                                                                                                                                                                                                                                                                                                                                                      | operator info                                | memory    | disk |
+------------------------------+---------+---------+-----------+-----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+-----------+------+
| StreamAgg_9                  | 1.00    | 1       | root      |                 | time:1m0.7s, loops:2                                                                                                                                                                                                                                                                                                                                                                | funcs:min(order.data.id)->Column#12          | 160 Bytes | N/A  |
| └─Limit_13                   | 1.00    | 1       | root      |                 | time:1m0.7s, loops:2                                                                                                                                                                                                                                                                                                                                                                | offset:0, count:1                            | N/A       | N/A  |
|   └─TableReader_24           | 1.00    | 1       | root      |                 | time:1m0.7s, loops:1, cop_task: {num: 221, max: 491.7ms, min: 584.9µs, avg: 274.4ms, p95: 419.9ms, max_proc_keys: 1, p95_proc_keys: 0, tot_proc: 1m0.3s, tot_wait: 23ms, rpc_num: 221, rpc_time: 1m0.6s, copr_cache_hit_ratio: 0.00, distsql_concurrency: 1}                                                                                                                        | data:Limit_23                                | 547 Bytes | N/A  |
|     └─Limit_23               | 1.00    | 1       | cop[tikv] |                 | tikv_task:{proc max:491ms, min:0s, avg: 273.5ms, p80:409ms, p95:419ms, iters:221, tasks:221}, scan_detail: {total_process_keys: 1, total_process_keys_size: 27, total_keys: 153800520, get_snapshot_time: 108ms, rocksdb: {delete_skipped_count: 14402724, key_skipped_count: 168203023, block: {cache_hit_count: 2191, read_count: 240664, read_byte: 2.66 GB, read_time: 11.2s}}} | offset:0, count:1                            | N/A       | N/A  |
|       └─TableFullScan_22     | 1.00    | 1       | cop[tikv] | table:data      | tikv_task:{proc max:491ms, min:0s, avg: 273.5ms, p80:409ms, p95:419ms, iters:221, tasks:221}                                                                                                                                                                                                                                                                                        | keep order:true                              | N/A       | N/A  |
+------------------------------+---------+---------+-----------+-----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+-----------+------+
5 rows in set (1 min 0.77 sec)
| username: TiDB_C罗 | Original post link

There is a pt-archiver tool running, deleting expired data. The first run was very fast, and subsequent runs are also very fast.

| username: 有猫万事足 | Original post link

Without a WHERE condition, a direct FROM t will result in either a table full scan or an index full scan.

  • CLUSTERED means that the primary key of the table is a clustered index. In a clustered index table, the key of the row data is composed of the primary key column data provided by the user, so a clustered index table stores at least one key-value pair per row, i.e.,
    • Primary key column data (key) - Row data (value)

You have created a clustered index table. There is no separate primary key index; the primary key is row_id. Therefore, it will directly perform a table full scan.

| username: TiDB_C罗 | Original post link

Well, but the execution time is long. What is the reason?

| username: 有猫万事足 | Original post link

Let’s see a complete execution plan.

| username: buddyyuan | Original post link

rpc_time: 1 minute 0.6 seconds

| username: buddyyuan | Original post link

How about changing distsql_concurrency to 15?