Report a bug: Incorrect pagination with LIMIT in hash join scenario

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 反馈一个bug,hash join的情况下,limit分页无法正确分

| username: TiDBer_jYQINSnf

The official response says this is correct. Do you, as users, accept it?
I think, regardless of whether the content of the order by is duplicated, using limit 0, 10, limit 10, 10, limit 20, 10 should correctly traverse the entire result set.

I have developed a middleware for sharding and table partitioning, and the implementation of limit is as follows:
limit 0, 10 will collect 10 entries from each shard, sort them, and then pick the top 10.
If limit 10, 10, it will collect 20 entries from each shard, sort them, and then pick the second set of 10.
This implementation essentially gathers all data from the shards in the end. It’s a rather clumsy method, but it doesn’t make mistakes, and the result set is stable.
Currently, with the hash join implementation, if the order by is duplicated, it randomly picks some entries, making limit 0, 10 and limit 10, 10 meaningless, as you can’t fully traverse the entire result set. For hash join, limit 0, 10 or limit 10, 10 means randomly picking 10 entries that meet the conditions, which I think is incorrect.

If hash join can’t handle this situation, then the execution plan shouldn’t choose hash join. The result defaults to hash join, and you have to use hints to force other joins. This is somewhat unscientific.

| username: Billmay表妹 | Original post link

I’ll ask the production and research experts~

| username: Billmay表妹 | Original post link

Reply from the production and research teacher:

The result is correct. If you want stable results, you should specify the id or table1_col3 in column order.

| username: TiDBer_jYQINSnf | Original post link

I saw the reply, but I do not agree with it.
I understand that the function of limit is to paginate a result set bit by bit. The current result is that limit cannot output the entire result set; it will output some results twice, and some results will never come out.

They said that “create_time is duplicated” can be used to explain this phenomenon, but it cannot prove that this is correct.

There are many sorting algorithms, some stable and some unstable. I think a stable sorting algorithm should be used here. Limit should act like a small sliding window, displaying the result set from top to bottom, rather than changing the result set every time you look at it, making it impossible to get all the results using limit. There are various ways to avoid this bug, but I believe the behavior of limit should not be like this.

I don’t know if you have an internal technical committee or something to study the SQL standard and see if this behavior truly meets the requirements.

If this is the definitive answer, then I come to a conclusion:
When TiDB processes results with duplicate data in Order by, it is impossible to get all entries in the result set using limit pagination.

Is TiDB willing to accept such a label?

| username: Billmay表妹 | Original post link

You can continue discussing this issue with the product and research team in the link provided in the issue.

| username: TiDBer_jYQINSnf | Original post link

My English is not good, so I can’t accurately express my meaning in English :man_shrugging:
For this issue, I temporarily asked the business side to use order by create_time desc, o.id asc to avoid it for now :man_shrugging:

| username: Billmay表妹 | Original post link

Here is an explanation: 结果集不稳定 | PingCAP 文档中心

| username: TiDBer_jYQINSnf | Original post link

The instability of the result set may seem trivial when querying all data, as the order doesn’t seem to matter much. However, when using limit for pagination, you will find that it is impossible to retrieve all the data in the table. This becomes a bug. Would you tell users that every SQL using limit needs to add an extra unique column to ensure the correctness of the results?

There may not be a precedent for distributed databases to follow, and users’ habits are shifting from single-machine databases to distributed databases. Shouldn’t the changes in user habits caused by the complexity of implementing distributed databases be carefully considered? Instead of simply saying: I can’t do it, it’s too complicated, it’s too troublesome.

Even if everyone can easily understand why the result is like this, it doesn’t necessarily mean that the result is correct.

TiDB is doing a lot of unprecedented work, and a small feature may become the de facto standard for distributed databases in the future. I hope decisions can be made more cautiously.

| username: 人如其名 | Original post link

From the user’s perspective, I agree with your statement and also want the limit to be stable. However, it seems that the MySQL syntax for limit is not intended for pagination and cannot guarantee stability (https://dev.mysql.com/doc/refman/8.0/en/limit-optimization.html). It is just stable in most scenarios, and people might have gotten used to using limit for pagination (unstable scenarios include changes in execution plans, table optimization, etc.).

Using row_number() over() for pagination is more rigorous, but relatively less efficient. Additionally, TiDB’s pagination and traditional database pagination differ slightly when generating sequence numbers.

mysql> select o_orderkey, row_number() over() as nbr from orders order by o_orderkey limit 10;
+------------+----------+
| o_orderkey | nbr      |
+------------+----------+
|          1 |  4436580 |
|          2 | 13517437 |
|          3 |  5491147 |
|          4 |  9463180 |
|          5 |  2440301 |
|          6 |  9888487 |
|          7 |  4486562 |
|         32 |  8920212 |
|         33 |  5572217 |
|         34 |   564184 |
+------------+----------+
10 rows in set (4.24 sec)
mysql> explain analyze select o_orderkey, row_number() over() as nbr from orders order by o_orderkey limit 10;
+----------------------------+-------------+----------+-----------+----------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------+---------+------+
| id                         | estRows     | actRows  | task      | access object                                | execution info                                                                                                                                                                                                                                                                                                       | operator info                                                          | memory  | disk |
+----------------------------+-------------+----------+-----------+----------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------+---------+------+
| TopN_12                    | 10.00       | 10       | root      |                                              | time:6.2s, loops:2                                                                                                                                                                                                                                                                                                   | tpch.orders.o_orderkey, offset:0, count:10                             | 26.7 KB | N/A  |
| └─Window_17                | 15000000.00 | 15000000 | root      |                                              | time:5.6s, loops:14661                                                                                                                                                                                                                                                                                               | row_number()->Column#11 over(rows between current row and current row) | N/A     | N/A  |
|   └─IndexReader_21         | 15000000.00 | 15000000 | root      |                                              | time:2.19s, loops:14661, cop_task: {num: 15, max: 4.56s, min: 1.95s, avg: 3.71s, p95: 4.56s, max_proc_keys: 1435897, p95_proc_keys: 1435897, tot_proc: 38.9s, tot_wait: 336ms, rpc_num: 15, rpc_time: 55.6s, copr_cache: disabled, distsql_concurrency: 15}                                                          | index:IndexFullScan_20                                                 | 95.8 MB | N/A  |
|     └─IndexFullScan_20     | 15000000.00 | 15000000 | cop[tikv] | table:orders, index:orders_idx1(O_ORDERDATE) | tikv_task:{proc max:3.22s, min:918ms, avg: 2.05s, p80:2.51s, p95:3.22s, iters:14720, tasks:15}, scan_detail: {total_process_keys: 15000000, total_process_keys_size: 690000000, total_keys: 15000020, rocksdb: {key_skipped_count: 15000005, block: {cache_hit_count: 6117, read_count: 3103, read_byte: 115.9 MB}}} | keep order:false                                                       | N/A     | N/A  |
+----------------------------+-------------+----------+-----------+----------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------+---------+------+
4 rows in set (6.21 sec)

This would not be the case in traditional databases, where the sequence numbers are arranged sequentially. If TiDB needs to achieve the same effect, it requires the following modification:

select o_orderkey, row_number() over() as nbr from (select o_orderkey from orders order by o_orderkey limit 10) a;

mysql> select o_orderkey, row_number() over() as nbr from (select o_orderkey from orders order by o_orderkey limit 10) a;
+------------+------+
| o_orderkey | nbr  |
+------------+------+
|          1 |    1 |
|          2 |    2 |
|          3 |    3 |
|          4 |    4 |
|          5 |    5 |
|          6 |    6 |
|          7 |    7 |
|         32 |    8 |
|         33 |    9 |
|         34 |   10 |
+------------+------+
10 rows in set (0.02 sec)

mysql> explain analyze select o_orderkey, row_number() over() as nbr from (select o_orderkey from orders order by o_orderkey limit 10) a;
+------------------------------+---------+---------+-----------+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------+-----------+------+
| id                           | estRows | actRows | task      | access object | execution info                                                                                                                                                                   | operator info                                                          | memory    | disk |
+------------------------------+---------+---------+-----------+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------+-----------+------+
| Window_11                    | 10.00   | 10      | root      |               | time:5.73ms, loops:2                                                                                                                                                             | row_number()->Column#11 over(rows between current row and current row) | N/A       | N/A  |
| └─Limit_15                   | 10.00   | 10      | root      |               | time:5.64ms, loops:2                                                                                                                                                             | offset:0, count:10                                                     | N/A       | N/A  |
|   └─TableReader_28           | 10.00   | 10      | root      |               | time:5.6ms, loops:1, cop_task: {num: 1, max: 4.93ms, proc_keys: 10, rpc_num: 1, rpc_time: 4.67ms, copr_cache: disabled, distsql_concurrency: 1}                                  | data:Limit_27                                                          | 297 Bytes | N/A  |
|     └─Limit_27               | 10.00   | 10      | cop[tikv] |               | tikv_task:{time:0s, loops:1}, scan_detail: {total_process_keys: 10, total_process_keys_size: 270, total_keys: 11, rocksdb: {key_skipped_count: 10, block: {cache_hit_count: 6}}} | offset:0, count:10                                                     | N/A       | N/A  |
|       └─TableFullScan_26     | 10.00   | 10      | cop[tikv] | table:orders  | tikv_task:{time:0s, loops:1}                                                                                                                                                     | keep order:true                                                        | N/A       | N/A  |
+------------------------------+---------+---------+-----------+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------+-----------+------+
5 rows in set (0.02 sec)
| username: TiDBer_jYQINSnf | Original post link

Many people use LIMIT for pagination internally, and I guess everyone is used to using it for pagination. Some frameworks automatically generate SQL, and I guess there might be bugs in such cases. Especially with those lists where you click the header to sort automatically, and there is pagination below, in such cases, some data might get lost as you flip through the pages. Businesses migrating from MySQL are likely to encounter this issue.

| username: TiDBer_jYQINSnf | Original post link

In a similar list, clicking on the title will sort according to the title, and there is also pagination below. In this case, will the subsequent SQL add a column to sort by a unique column?

| username: Hacker_xUwtuKxa | Original post link

Just to interject, but it doesn’t affect [Hacker_hnSEntrA]'s question.
For the example of [人如其名], you can consider adding sorting in the window function:

select name2, row_number() over(order by name2) from t2 order by name2 limit 10, 10;

The output is as follows:

+----------------------+-----------------------------------+
| name                 | row_number() over(order by name2) |
+----------------------+-----------------------------------+
| b2305843009213694473 |                                11 |
| b2305843009213694474 |                                12 |
| b2305843009213694475 |                                13 |
| b2305843009213694476 |                                14 |
| b2305843009213694477 |                                15 |
| b2305843009213694478 |                                16 |
| b2305843009213694479 |                                17 |
| b2305843009213694480 |                                18 |
| b2305843009213694481 |                                19 |
| b2305843009213694482 |                                20 |
+----------------------+-----------------------------------+
| username: 人如其名 | Original post link

Okay, thank you. However, this behavior is indeed different from other databases (such as MySQL 8.0). Is it possible to make it consistent by default?

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.