TiDB Optimizer Selection Issues

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB优化器选择问题

| username: mono

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.5.7
[Encountered Problem]
The most basic SQL. The execution plan chosen by the optimizer has issues, causing slow queries.

SQL statement:

select * from t1 where create_time >= '2024-03-01 00:00:00' limit 2;

This table has an index on create_time. However, the query does not use the index by default.

mysql> explain select * from t1 where create_time >= '2024-03-01 00:00:00' limit 2;
+------------------------------+---------+-----------+-------------------+------------------------------------------------------------------+
| id                           | estRows | task      | access object     | operator info                                                    |
+------------------------------+---------+-----------+-------------------+------------------------------------------------------------------+
| Limit_8                      | 2.00    | root      |                   | offset:0, count:2                                                |
| └─TableReader_13             | 2.00    | root      |                   | data:Limit_12                                                    |
|   └─Limit_12                 | 2.00    | cop[tikv] |                   | offset:0, count:2                                                |
|     └─Selection_11           | 2.00    | cop[tikv] |                   | ge(t1.weplay_user.create_time, 2024-03-01 00:00:00.000000) |
|       └─TableFullScan_10     | 21.50   | cop[tikv] | table:weplay_user | keep order:false                                                 |
+------------------------------+---------+-----------+-------------------+------------------------------------------------------------------+
5 rows in set (0.00 sec)
mysql> explain select * from t1 use index(idx_create_time) where create_time >= '2024-03-01 00:00:00' limit 2;
+--------------------------------+---------+-----------+-------------------------------------------------------+----------------------------------------------------+
| id                             | estRows | task      | access object                                         | operator info                                      |
+--------------------------------+---------+-----------+-------------------------------------------------------+----------------------------------------------------+
| IndexLookUp_13                 | 2.00    | root      |                                                       | limit embedded(offset:0, count:2)                  |
| ├─Limit_12(Build)              | 2.00    | cop[tikv] |                                                       | offset:0, count:2                                  |
| │ └─IndexRangeScan_10          | 2.00    | cop[tikv] | table:t1, index:idx_create_time(create_time) | range:[2024-03-01 00:00:00,+inf], keep order:false |
| └─TableRowIDScan_11(Probe)     | 2.00    | cop[tikv] | table:t1                                     | keep order:false                                   |
+--------------------------------+---------+-----------+-------------------------------------------------------+----------------------------------------------------+
| username: zhanggame1 | Original post link

Try manually analyzing the table with ANALYZE TABLE. Additionally, how many rows of data are in the table, and how many rows does this statement actually return? From estRows, isn’t the number of data rows too few?

| username: zhang_2023 | Original post link

Collect statistics

| username: 连连看db | Original post link

This situation can also occur when executing in MySQL, either due to inaccurate statistics or a large amount of table data.

| username: buddyyuan | Original post link

How much data is in your table? The previous estimate was that it could find 2 rows with >= ‘2024-03-01 00:00:00’ within 21.5 rows.

| username: tidb菜鸟一只 | Original post link

  1. Collect some statistics and take a look.
  2. Is the table very small, with a total data volume so small that there’s no need to use an index?
| username: TiDBer_jYQINSnf | Original post link

It might be related to the health of the table. I have encountered a situation where the SQL was very fast at first, but later it suddenly became much slower. After investigating for a long time, I found that the execution plan had changed and a slower index was chosen, which was caused by the low health of the table.

| username: buddyyuan | Original post link

Actually, it’s quite normal. This is likely due to a significant data skew, causing the optimizer to determine that a full table scan of 21 rows is sufficient to find data >= ‘2024-03-01 00:00:00’. If the optimizer determines that it needs to scan millions of rows to find data >= ‘2024-03-01 00:00:00’, it will use the index. Therefore, collecting some statistics and ensuring the histogram and topN are accurate should resolve the issue.

| username: Kongdom | Original post link

:thinking: Is this table very small in terms of data volume? If the data volume is particularly small, it will directly perform a full table scan.

| username: 小龙虾爱大龙虾 | Original post link

Brother Xiang is right :+1:

| username: mono | Original post link

This is a table with over 10 million rows. Without using the index, it takes about 5 seconds. Forcing the use of the idx_create_time index, the query time is in milliseconds.

| username: redgame | Original post link

Is it a newly created table?

| username: Kongdom | Original post link

:thinking: That might be the situation mentioned by the previous poster.

| username: TiDBer_aaO4sU46 | Original post link

As the experts suggested, collect some statistics and try again.

| username: residentevil | Original post link

After analyzing the table, if it is determined that the filter condition has an INDEX (must be a prefix), first count the total number of entries, then count based on the filter condition. If the difference in row count is greater than 15%, the optimizer will consider a full table scan to be faster.

| username: dba远航 | Original post link

Try removing “limit 2;” and see what happens.