The issue of execution time changes after forcibly specifying the join order in TiDB index join

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb index join 强行指定连接顺序后执行时间变化的问题

| username: Raymond

Dear experts, please test the execution performance of SQL statements under the tpch database involving the customer table (b table) and orders table (a table).

mysql> select count() from customer;
±---------+
| count(
) |
±---------+
| 3000000 |
±---------+
1 row in set (0.41 sec)

mysql> select count() from orders;
±---------+
| count(
) |
±---------+
| 29955968 |
±---------+
1 row in set (4.38 sec)

  1. First, I executed statement 1. The join order of the index join is determined by the optimizer itself, with the b table as the outer table and the a table as the inner table. The execution time was found to be 52.99s.
mysql> explain analyze select /*+ INL_JOIN(a,b) */ a.O_ORDERKEY, a.O_CUSTKEY,b.C_NAME  from orders a  join customer b on  a.O_CUSTKEY = b.C_CUSTKEY;
+-----------------------------+-------------+----------+-----------+-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+------+
| id                          | estRows     | actRows  | task      | access object                             | execution info                                                                                                                                                                                                                                                                                               | operator info                                                                                                                                                       | memory  | disk |
+-----------------------------+-------------+----------+-----------+-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+------+
| IndexJoin_19                | 29955968.00 | 29955968 | root      |                                           | time:53s, loops:29255, inner:{total:4m16.6s, concurrency:5, task:126, construct:1.76s, fetch:4m11.4s, build:3.42s}, probe:4.25s                                                                                                                                                                              | inner join, inner:IndexReader_18, outer key:tpch.customer.c_custkey, inner key:tpch.orders.o_custkey, equal cond:eq(tpch.customer.c_custkey, tpch.orders.o_custkey) | 55.6 MB | N/A  |
| ├─TableReader_31(Build)     | 3000000.00  | 3000000  | root      |                                           | time:93.9ms, loops:2941, cop_task: {num: 116, max: 429.3ms, min: 3.11ms, avg: 88ms, p95: 266.5ms, max_proc_keys: 50144, p95_proc_keys: 50144, tot_proc: 4.35s, tot_wait: 434ms, rpc_num: 116, rpc_time: 10.2s, copr_cache: disabled, distsql_concurrency: 15}                                                | data:TableFullScan_30                                                                                                                                               | 13.0 MB | N/A  |
| │ └─TableFullScan_30        | 3000000.00  | 3000000  | cop[tikv] | table:b                                   | tikv_task:{proc max:278ms, min:0s, avg: 35.5ms, p80:50ms, p95:155ms, iters:3387, tasks:116}, scan_detail: {total_process_keys: 3000000, total_process_keys_size: 610451426, total_keys: 3000116, get_snapshot_time: 41.6ms, rocksdb: {key_skipped_count: 3000000, block: {cache_hit_count: 10583}}}          | keep order:false                                                                                                                                                    | N/A     | N/A  |
| └─IndexReader_18(Probe)     | 29955968.00 | 29955968 | root      |                                           | time:4m7.8s, loops:29678, cop_task: {num: 2635, max: 906.4ms, min: 2.16ms, avg: 119.1ms, p95: 436.1ms, max_proc_keys: 50144, p95_proc_keys: 50144, tot_proc: 2m55.8s, tot_wait: 22.4s, rpc_num: 2635, rpc_time: 5m13.7s, copr_cache: disabled, distsql_concurrency: 15}                                      | index:IndexRangeScan_17                                                                                                                                             | 12.1 KB | N/A  |
|   └─IndexRangeScan_17       | 29955968.00 | 29955968 | cop[tikv] | table:a, index:index_o_custkey(O_CUSTKEY) | tikv_task:{proc max:723ms, min:0s, avg: 66ms, p80:104ms, p95:352ms, iters:39564, tasks:2635}, scan_detail: {total_process_keys: 29955968, total_process_keys_size: 1377974528, total_keys: 32958329, get_snapshot_time: 403.3ms, rocksdb: {key_skipped_count: 29955968, block: {cache_hit_count: 16464198}}} | range: decided by [eq(tpch.orders.o_custkey, tpch.customer.c_custkey)], keep order:false                                                                            | N/A     | N/A  |
+-----------------------------+-------------+----------+-----------+-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+------+
5 rows in set (52.99 sec)

  1. Executed statement 2, which forcibly specified the join order, making the a table the outer table and the b table the inner table. The execution time was significantly reduced to 27.66s.
mysql>  explain analyze select /*+ INL_JOIN(b) */   a.O_ORDERKEY, a.O_CUSTKEY,b.C_NAME  from orders a  join customer b on  a.O_CUSTKEY = b.C_CUSTKEY;
+-----------------------------+-------------+----------+-----------+-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+------+
| id                          | estRows     | actRows  | task      | access object                             | execution info                                                                                                                                                                                                                                                                                           | operator info                                                                                                                                                      | memory  | disk |
+-----------------------------+-------------+----------+-----------+-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+------+
| IndexJoin_12                | 29955968.00 | 29955968 | root      |                                           | time:27.6s, loops:29255, inner:{total:2m9.3s, concurrency:5, task:1179, construct:11.6s, fetch:1m57.3s, build:398.2ms}, probe:7.04s                                                                                                                                                                      | inner join, inner:TableReader_9, outer key:tpch.orders.o_custkey, inner key:tpch.customer.c_custkey, equal cond:eq(tpch.orders.o_custkey, tpch.customer.c_custkey) | 15.2 MB | N/A  |
| ├─IndexReader_20(Build)     | 29955968.00 | 29955968 | root      |                                           | time:523.5ms, loops:29300, cop_task: {num: 784, max: 250.2ms, min: 2.26ms, avg: 81.5ms, p95: 168.1ms, max_proc_keys: 50144, p95_proc_keys: 50144, tot_proc: 21.1s, tot_wait: 866ms, rpc_num: 784, rpc_time: 1m3.9s, copr_cache: disabled, distsql_concurrency: 15}                                       | index:IndexFullScan_19                                                                                                                                             | 12.3 MB | N/A  |
| │ └─IndexFullScan_19        | 29955968.00 | 29955968 | cop[tikv] | table:a, index:index_o_custkey(O_CUSTKEY) | tikv_task:{proc max:115ms, min:0s, avg: 25.9ms, p80:39ms, p95:66ms, iters:32383, tasks:784}, scan_detail: {total_process_keys: 29955968, total_process_keys_size: 1377974528, total_keys: 29956752, get_snapshot_time: 133.9ms, rocksdb: {key_skipped_count: 29955968, block: {cache_hit_count: 23760}}} | keep order:false                                                                                                                                                   | N/A     | N/A  |
| └─TableReader_9(Probe)      | 29955968.00 | 2001764  | root      |                                           | time:1m56.3s, loops:3529, cop_task: {num: 5146, max: 140.6ms, min: 621.7µs, avg: 25.4ms, p95: 56.4ms, max_proc_keys: 992, p95_proc_keys: 992, tot_proc: 16.5s, tot_wait: 6.31s, rpc_num: 5146, rpc_time: 2m10.6s, copr_cache: disabled, distsql_concurrency: 15}                                         | data:TableRangeScan_8                                                                                                                                              | N/A     | N/A  |
|   └─TableRangeScan_8        | 29955968.00 | 2001764  | cop[tikv] | table:b                                   | tikv_task:{proc max:75ms, min:0s, avg: 3.59ms, p80:5ms, p95:10ms, iters:17550, tasks:5146}, scan_detail: {total_process_keys: 2001764, total_process_keys_size: 407321028, total_keys: 3005092, get_snapshot_time: 129.1ms, rocksdb: {key_skipped_count: 1999798, block: {cache_hit_count: 5439074}}}    | range: decided by [tpch.orders.o_custkey], keep order:false                                                                                                        | N/A     | N/A  |
+-----------------------------+-------------+----------+-----------+-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+------+
5 rows in set (27.66 sec)

This is very confusing. Initially, I guessed that even though the a table has a large amount of data, since the a table uses an index and does not need to go back to the table, the scanning speed is very fast despite having nearly 30 million rows, so the a table as the outer table would be faster.

So I tried the third statement:
mysql> explain analyze select /*+ INL_JOIN(b) */ a.O_ORDERKEY, a.O_CUSTKEY,a.O_ORDERDATE,b.C_NAME from orders a join customer b on a.O_CUSTKEY = b.C_CUSTKEY;

mysql>  explain analyze select /*+ INL_JOIN(b) */   a.O_ORDERKEY, a.O_CUSTKEY,a.O_ORDERDATE,b.C_NAME  from orders a  join customer b on  a.O_CUSTKEY = b.C_CUSTKEY;
+-----------------------------+-------------+----------+-----------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+------+
| id                          | estRows     | actRows  | task      | access object | execution info                                                                                                                                                                                                                                                                                                                                                      | operator info                                                                                                                                                      | memory  | disk |
+-----------------------------+-------------+----------+-----------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+------+
| IndexJoin_12                | 29955968.00 | 29955968 | root      |               | time:2m58.7s, loops:29255, inner:{total:14m50.6s, concurrency:5, task:1180, construct:46.3s, fetch:13m59.3s, build:4.91s}, probe:18.3s                                                                                                                                                                                                                              | inner join, inner:TableReader_9, outer key:tpch.orders.o_custkey, inner key:tpch.customer.c_custkey, equal cond:eq(tpch.orders.o_custkey, tpch.customer.c_custkey) | 21.9 MB | N/A  |
| ├─TableReader_18(Build)     | 29955968.00 | 29955968 | root      |               | time:1.59s, loops:29317, cop_task: {num: 1025, max: 608.4ms, min: 3.23ms, avg: 158.7ms, p95: 333.4ms, max_proc_keys: 50144, p95_proc_keys: 50144, tot_proc: 1m30.4s, tot_wait: 7.71s, rpc_num: 1025, rpc_time: 2m42.6s, copr_cache: disabled, distsql_concurrency: 15}                                                                                              | data:TableFullScan_17                                                                                                                                              | 18.4 MB | N/A  |
| │ └─TableFullScan_17        | 29955968.00 | 29955968 | cop[tikv] | table:a       | tikv_task:{proc max:515ms, min:0s, avg: 84.3ms, p80:154ms, p95:228ms, iters:33322, tasks:1025}, scan_detail: {total_process_keys: 29955968, total_process_keys_size: 4549646548, total_keys: 29956993, get_snapshot_time: 1.68s, rocksdb: {key_skipped_count: 29955968, block: {cache_hit_count: 63232, read_count: 16425, read_byte: 282.6 MB, read_time: 11.8s}}} | keep order:false                                                                                                                                                   | N/A     | N/A  |
| └─TableReader_9(Probe)      | 29955968.00 | 29744299 | root      |               | time:13m43.2s, loops:34209, cop_task: {num: 39435, max: 595.3ms, min: 657.8µs, avg: 77.6ms, p95: 211.6ms, max_proc_keys: 2016, p95_proc_keys: 2016, tot_proc: 21m47s, tot_wait: 8m43s, rpc_num: 39435, rpc_time: 50m57.4s, copr_cache: disabled, distsql_concurrency: 15}                                                                                           | data:TableRangeScan_8                                                                                                                                              | N/A     | N/A  |
|   └─TableRangeScan_8        | 29955968.00 | 29744299 | cop[tikv] | table:b       | tikv_task:{proc max:556ms, min:0s, avg: 33.4ms, p80:60ms, p95:133ms, iters:162048, tasks:39435}, scan_detail: {total_process_keys: 29744299, total_process_keys_size: 6052480951, total_keys: 29943155, get_snapshot_time: 2.1s, rocksdb: {key_skipped_count: 335362, block: {cache_hit_count: 141357867, read_count: 6, read_byte: 616.2 KB, read_time: 72.8ms}}}  | range: decided by [tpch.orders.o_custkey], keep order:false                                                                                                        | N/A     | N/A  |
+-----------------------------+-------------+----------+-----------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+------+
5 rows in set (2 min 58.70 sec)

In the third statement, the a table could not use the index, and the total execution time reached 2 min 58.70 sec, which is very confusing. Additionally, in the probe phase of the b table, there are 29744299 rows, but the b table only has 3 million rows in total, which is hard to explain.

Seeking expert explanation:

  1. Why is statement 2 much faster than statement 1?
    My guess:

  2. The a table uses an index and does not need to go back to the table, so the scanning speed is fast.

  3. In the probe phase of the index join, statement 1 needs to perform secondary index lookups on nearly 30 million rows of the a table, which is slower than scanning the entire 3 million rows of the b table.

  4. In the execution plan of statement 3, the probe phase of the b table has 29744299 rows, but the b table only has 3 million rows in total. How can this be explained?

| username: 人如其名 | Original post link

Background knowledge: One of the optimization points of IndexJoin is to convert seek operations into next sequential scans as much as possible when looking up data. Seek requires binary search, which may involve reading multiple blocks, while next is a linked list search, which is entirely the next memory address, making it very efficient.
Example: Suppose you are looking for keys x and y.

  1. If x and y are completely consecutive, TiKV will use seek to locate x, then try next to find the next key value y. Since they are consecutive, it will find y with just one next operation.
  2. If x and y are approximately consecutive, a few next operations will still find y. Within a limited number of next scans, the efficiency is still much higher than seek.
  3. If x and y are completely non-consecutive and far apart, TiKV will default to using seek after 10 unsuccessful next attempts to find y, resulting in two seek operations and potentially reading many blocks.

For IndexJoin, the left table (I hate the build and probe terms in TiDB execution plans!!!) processes a batch of data, obtains keys based on the join condition, and sorts and deduplicates the keys!!! The sorted keys form a kvRange to read the right table data from TiKV. Why sort the keys? To convert seek operations into next operations as much as possible. If most seek operations can be converted into next operations, efficiency will greatly improve as the number of blocks read will significantly decrease.
Based on the above theoretical knowledge, let’s analyze your problem:

  • Why is statement 2 much faster than statement 1?

    1. The most obvious reason is that statement 2 reads far fewer blocks than statement 1, so it is faster.

    2. In-depth analysis: In statement 1, the left table (customer table b) o_custkey is a primary key that, after sorting and deduplication (this should be a clustered index table), does not filter data and still needs to fetch all matching data from the right table (order table a). Therefore, for any key in the left table, the right table can almost always find several matching keys and return data, resulting in many block reads. In statement 2, the left table (order table a) fetches data through the index index_o_custkey (O_CUSTKEY), but the overall read index is roughly ordered (although keeporder=false is not strictly ordered, and the number of keys in a region is large enough to ignore cross-region parallel disorder). Therefore, after the left table gets a batch of data (which is roughly ordered globally), it sorts and deduplicates the keys. During this process, it can remove many duplicate keys (after all, customer.c_custkey and order.o_custkey have a one-to-many relationship). Overall, it only needs to find 2,001,764 keys. The left table scans the entire table (using next, reading fewer blocks), and the right table needs to find far fewer keys (much fewer than the right table in statement 1), so statement 2 is faster!

  • Why does the execution plan in statement 3 show 29,744,299 rows in the probe phase of table b, but table b only has 3 million rows in total?
    For statement 3, the left table (order a) scans the entire table. For a clustered index (primary key), the full table scan is roughly ordered, but the o_custkey inside is completely unordered! Therefore, after the left table gets a batch of data, it finds the join condition (o_custkey) and sorts and deduplicates it. However, because the overall data is unordered, the o_custkey in this batch is unlikely to be deduplicated. Therefore, sorting and deduplication are likely useless, and almost all keys (29,955,968) need to be traversed in the right table to find related records (the number of keys to be found is 29,744,299, with only 210,000 deduplicated). Additionally, because most keys obtained by the left table in each batch are non-consecutive, many seeks in the right table cannot be converted into next operations, resulting in many block reads. Therefore, the efficiency is much worse than in statements 1 and 2.

Additionally, when posting execution plans, format the text instead of pasting it directly. It’s too hard to read otherwise!
Here’s a formatted version.

| username: ShawnYan | Original post link

Perhaps a link to “The Art of Asking Questions” should be added here.

| username: 人如其名 | Original post link

Using the principle of sorting and deduplication of keys!!! to optimize the query efficiency of the right table, we first sort table a by o_custkey, and then join it with customer b. This results in fewer records being read from customer b. The test is as follows:

Original statement similar to statement 3:
mysql> explain analyze select /*+ INL_JOIN(a,b) */ a.O_ORDERKEY, a.O_CUSTKEY, b.C_NAME from orders a join customer b on a.O_CUSTKEY = b.C_CUSTKEY;
+-------------------------------+-------------+----------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------+
| id                            | estRows     | actRows  | task      | access object | execution info                                                                                                                                                                                                                                                                                                                            | operator info                                                                                                                                                               | memory   | disk |
+-------------------------------+-------------+----------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------+
| Projection_9                  | 15000000.00 | 15000000 | root      |               | time:41.6s, loops:14650, RU:139703.470597, Concurrency:5                                                                                                                                                                                                                                                                                  | tpch10.orders.o_orderkey, tpch10.orders.o_custkey, tpch10.customer.c_name                                                                                                   | 515.4 KB | N/A  |
| └─IndexJoin_14                | 15000000.00 | 15000000 | root      |               | time:40.2s, loops:14650, inner:{total:3m22.3s, concurrency:5, task:595, construct:23.6s, fetch:2m54.4s, build:4.34s}, probe:12.7s                                                                                                                                                                                                         | inner join, inner:TableReader_11, outer key:tpch10.orders.o_custkey, inner key:tpch10.customer.c_custkey, equal cond:eq(tpch10.orders.o_custkey, tpch10.customer.c_custkey) | 19.5 MB  | N/A  |
|   ├─TableReader_20(Build)     | 15000000.00 | 15000000 | root      |               | time:634.3ms, loops:14700, cop_task: {num: 696, max: 276.3ms, min: 643.3µs, avg: 54.7ms, p95: 144ms, max_proc_keys: 50144, p95_proc_keys: 50144, tot_proc: 11.2s, tot_wait: 1.24s, rpc_num: 696, rpc_time: 37.9s, copr_cache_hit_ratio: 0.00, build_task_duration: 36.3µs, max_distsql_concurrency: 15}                                   | data:TableFullScan_19                                                                                                                                                       | 12.3 MB  | N/A  |
|   │ └─TableFullScan_19        | 15000000.00 | 15000000 | cop[tikv] | table:a       | tikv_task:{proc max:164ms, min:0s, avg: 19.4ms, p80:38ms, p95:69ms, iters:17387, tasks:696}, scan_detail: {total_process_keys: 14996992, total_process_keys_size: 2277281755, total_keys: 14997686, get_snapshot_time: 9.31ms, rocksdb: {key_skipped_count: 14996992, block: {cache_hit_count: 78435}}}                                   | keep order:false                                                                                                                                                            | N/A      | N/A  |
|   └─TableReader_11(Probe)     | 15000000.00 | 14789709 | root      |               | time:2m42.4s, loops:16536, cop_task: {num: 18511, max: 319.9ms, min: 304.8µs, avg: 48.6ms, p95: 117.3ms, max_proc_keys: 2016, p95_proc_keys: 2016, tot_proc: 2m18.6s, tot_wait: 27.8s, rpc_num: 18511, rpc_time: 14m58.9s, copr_cache_hit_ratio: 0.00, build_task_duration: 1.46s, max_distsql_concurrency: 7, max_extra_concurrency: 3}  | data:TableRangeScan_10                                                                                                                                                      | N/A      | N/A  |
|     └─TableRangeScan_10       | 15000000.00 | 14789709 | cop[tikv] | table:b       | tikv_task:{proc max:0s, min:0s, avg: 12.5ms, p80:22ms, p95:52ms, iters:78329, tasks:18511}, scan_detail: {total_process_keys: 14789529, total_process_keys_size: 3009488955, total_keys: 14970326, get_snapshot_time: 386.7ms, rocksdb: {key_skipped_count: 332902, block: {cache_hit_count: 5801406}}}                                   | range: decided by [tpch10.orders.o_custkey], keep order:false                                                                                                               | N/A      | N/A  |
+-------------------------------+-------------+----------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------+
6 rows in set (41.59 sec)

-- Optimized statement
-- By sorting the data in table a by o_custkey and then joining it with table b, we can see that the number of records scanned in table b is significantly reduced. This demonstrates the use of "sorting and deduplication of keys!!!". Even with the sort operator (which is time-consuming), the execution time is faster than the above SQL, and fewer block reads are generated.
mysql> explain analyze select /*+ INL_JOIN(a,b) */ a.O_ORDERKEY, a.O_CUSTKEY, b.C_NAME from (select * from orders a order by o_custkey ) a join customer b on a.O_CUSTKEY = b.C_CUSTKEY;
+--------------------------------+-------------+----------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------+
| id                             | estRows     | actRows  | task      | access object | execution info                                                                                                                                                                                                                                                                                                                | operator info                                                                                                                                                               | memory   | disk    |
+--------------------------------+-------------+----------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------+
| Projection_11                  | 15000000.00 | 15000000 | root      |               | time:36.2s, loops:14650, RU:43678.919975, Concurrency:5                                                                                                                                                                                                                                                                       | tpch10.orders.o_orderkey, tpch10.orders.o_custkey, tpch10.customer.c_name                                                                                                   | 519.7 KB | N/A     |
| └─IndexJoin_17                 | 15000000.00 | 15000000 | root      |               | time:36.1s, loops:14650, inner:{total:10s, concurrency:5, task:594, construct:4.34s, fetch:5.52s, build:176.4ms}, probe:1.81s                                                                                                                                                                                                 | inner join, inner:TableReader_14, outer key:tpch10.orders.o_custkey, inner key:tpch10.customer.c_custkey, equal cond:eq(tpch10.orders.o_custkey, tpch10.customer.c_custkey) | 11.9 MB  | N/A     |
|   ├─Sort_22(Build)             | 15000000.00 | 15000000 | root      |               | time:35.5s, loops:14654                                                                                                                                                                                                                                                                                                       | tpch10.orders.o_custkey                                                                                                                                                     | 352.0 MB | 0 Bytes |
|   │ └─TableReader_25           | 15000000.00 | 15000000 | root      |               | time:1.42s, loops:14707, cop_task: {num: 696, max: 121.1ms, min: 371.9µs, avg: 29.6ms, p95: 71.3ms, max_proc_keys: 50144, p95_proc_keys: 50144, tot_proc: 10.3s, tot_wait: 258.5ms, rpc_num: 696, rpc_time: 20.6s, copr_cache_hit_ratio: 0.00, build_task_duration: 48.3µs, max_distsql_concurrency: 15}                      | data:TableFullScan_24                                                                                                                                                       | 4.80 MB  | N/A     |
|   │   └─TableFullScan_24       | 15000000.00 | 15000000 | cop[tikv] | table:a       | tikv_task:{proc max:80ms, min:0s, avg: 16.5ms, p80:35ms, p95:61ms, iters:17387, tasks:696}, scan_detail: {total_process_keys: 14999008, total_process_keys_size: 2277588077, total_keys: 14999703, get_snapshot_time: 7.91ms, rocksdb: {key_skipped_count: 14999008, block: {cache_hit_count: 78449}}}                        | keep order:false                                                                                                                                                            | N/A      | N/A     |
|   └─TableReader_14(Probe)      | 15000000.00 | 1000537  | root      |               | time:5.14s, loops:1774, cop_task: {num: 2282, max: 102.9ms, min: 280µs, avg: 2.18ms, p95: 4.83ms, max_proc_keys: 992, p95_proc_keys: 992, tot_proc: 2.89s, tot_wait: 198.8ms, rpc_num: 2282, rpc_time: 4.95s, copr_cache_hit_ratio: 0.00, build_task_duration: 24.2ms, max_distsql_concurrency: 2, max_extra_concurrency: 1}  | data:TableRangeScan_13                                                                                                                                                      | N/A      | N/A     |
|     └─TableRangeScan_13        | 15000000.00 | 1000537  | cop[tikv] | table:b       | tikv_task:{proc max:0s, min:0s, avg: 1.14ms, p80:2ms, p95:3ms, iters:7599, tasks:2282}, scan_detail: {total_process_keys: 1000537, total_process_keys_size: 203587915, total_keys: 1502181, get_snapshot_time: 31.7ms, rocksdb: {key_skipped_count: 999924, block: {cache_hit_count: 1023839}}}                               | range: decided by [tpch10.orders.o_custkey], keep order:false                                                                                                               | N/A      | N/A     |
+--------------------------------+-------------+----------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------+
7 rows in set (36.18 sec)

The above example just illustrates an optimization idea. For instance, sorting a small amount of data before joining may yield unexpected results.

| username: Raymond | Original post link

Thank you, expert.

I have two thoughts:

  1. If, like in statement 3, the data scanned from the outer table is unordered, matching it with the inner table will take a long time. Can we allocate a memory area to sort the batch of data from the outer table based on the join field before matching it with the inner table? Would this improve performance?

  2. If, like in statement 1, the data from table a is matched using an index, but we still need a.O_ORDERKEY, can we sort the primary key IDs in memory? This way, we can make better use of the “next” operation when accessing the table, improving efficiency.

| username: 人如其名 | Original post link

No, because we optimized it after understanding the business characteristics. Since the custkey of the order can be found in the customer, the efficiency is not low. If most of the data in the order table cannot be found in the customer, then using the large table as the left table is inherently slow, and manual sorting would be even slower. Sorting and then looking up data in the customer table may not be continuous. Therefore, the current design of the optimizer is the most general and has no issues. This kind of data distribution situation, which can only be known during the execution phase, requires understanding the business characteristics for optimization. If further optimization is desired, the optimizer needs to know the characteristics of the association between the two tables, which means supporting multi-table association statistics views. This way, there could be more room for optimization.

| username: Raymond | Original post link

Actually, I later discovered that the semantics of statement 3 are slightly different from those of statements 1 and 2. Statement 3 has an additional a.O_ORDERDATE return field.

explain analyze a.O_ORDERKEY, a.O_CUSTKEY, a.O_ORDERDATE, b.C_NAME from orders a join customer b on a.O_CUSTKEY = b.C_CUSTKEY;

Today, I swapped the index join order of statement 3, which will be referred to as statement 4 below. In statement 4, I made table a the inner table and found that the execution time was around 8 minutes. Compared to statement 3, I believe the main time consumption is due to scanning the inner table, which requires a table lookup. Since table a has nearly 30 million rows, this cost is enormous.

mysql> explain analyze select /*+ INL_JOIN(a) */ a.O_ORDERKEY, a.O_CUSTKEY, a.O_ORDERDATE, b.C_NAME from orders a join customer b on a.O_CUSTKEY = b.C_CUSTKEY;
+----------------------------------+-------------+----------+-----------+-------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+------+
| id                               | estRows     | actRows  | task      | access object                             | execution info                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | operator info                                                                                                                                                       | memory  | disk |
+----------------------------------+-------------+----------+-----------+-------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+------+
| IndexJoin_20                     | 29955968.00 | 29955968 | root      |                                           | time:8m1.2s, loops:29255, inner:{total:39m42.8s, concurrency:5, task:125, construct:5.63s, fetch:39m26.7s, build:10.5s}, probe:11.3s                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | inner join, inner:IndexLookUp_19, outer key:tpch.customer.c_custkey, inner key:tpch.orders.o_custkey, equal cond:eq(tpch.customer.c_custkey, tpch.orders.o_custkey) | 70.9 MB | N/A  |
| ├─TableReader_29(Build)          | 3000000.00  | 3000000  | root      |                                           | time:367.5ms, loops:2942, cop_task: {num: 116, max: 1m44.6s, min: 4.91ms, avg: 3.3s, p95: 28.3s, max_proc_keys: 50144, p95_proc_keys: 50144, tot_proc: 4m52.7s, tot_wait: 13.3s, rpc_num: 117, rpc_time: 6m21.6s, copr_cache: disabled, distsql_concurrency: 15}, backoff{tikvRPC: 84ms, regionMiss: 2ms}                                                                                                                                                                                                                                                                                                                                                                                                               | data:TableFullScan_28                                                                                                                                               | 12.2 MB | N/A  |
| │ └─TableFullScan_28             | 3000000.00  | 3000000  | cop[tikv] | table:b                                   | tikv_task:{proc max:57.8s, min:4ms, avg: 2.52s, p80:624ms, p95:27.8s, iters:3387, tasks:116}, scan_detail: {total_process_keys: 3000000, total_process_keys_size: 610451426, total_keys: 3000116, get_snapshot_time: 6.26s, rocksdb: {key_skipped_count: 3000000, block: {cache_hit_count: 832, read_count: 9903, read_byte: 226.6 MB, read_time: 18.8s}}}                                                                                                                                                                                                                                                                                                                                                              | keep order:false                                                                                                                                                    | N/A     | N/A  |
| └─IndexLookUp_19(Probe)          | 29955968.00 | 29955968 | root      |                                           | time:39m15.7s, loops:29445, index_task: {total_time: 27m38.9s, fetch_handle: 22m33s, build: 4.71ms, wait: 5m5.9s}, table_task: {total_time: 1h50m6.6s, num: 1900, concurrency: 5}, next: {wait_index: 4m0.1s, wait_table_lookup_build: 6.27s, wait_table_lookup_resp: 34m41.6s}                                                                                                                                                                                                                                                                                                                                                                                                                                         |                                                                                                                                                                     | 78.3 KB | N/A  |
|   ├─IndexRangeScan_17(Build)     | 29955968.00 | 29955968 | cop[tikv] | table:a, index:index_o_custkey(O_CUSTKEY) | time:22m21.2s, loops:29913, cop_task: {num: 2610, max: 59.6s, min: 2.64ms, avg: 821ms, p95: 1.97s, max_proc_keys: 50144, p95_proc_keys: 50144, tot_proc: 26m49.2s, tot_wait: 4m18.1s, rpc_num: 2692, rpc_time: 38m34.8s, copr_cache: disabled, distsql_concurrency: 15}, backoff{regionMiss: 82ms, pdRPC: 679ms, tikvRPC: 710ms}, tikv_task:{proc max:59.3s, min:0s, avg: 616.5ms, p80:585ms, p95:1.54s, iters:39469, tasks:2610}, scan_detail: {total_process_keys: 29955968, total_process_keys_size: 1377974528, total_keys: 32958313, get_snapshot_time: 7.22s, rocksdb: {key_skipped_count: 29955968, block: {cache_hit_count: 19160979, read_count: 19599, read_byte: 235.6 MB, read_time: 4m46.7s}}}             | range: decided by [eq(tpch.orders.o_custkey, tpch.customer.c_custkey)], keep order:false                                                                            | N/A     | N/A  |
|   └─TableRowIDScan_18(Probe)     | 29955968.00 | 29955968 | cop[tikv] | table:a                                   | time:1h48m4.5s, loops:31426, cop_task: {num: 96899, max: 1m45s, min: 532.3µs, avg: 700.1ms, p95: 1.21s, max_proc_keys: 655, p95_proc_keys: 429, tot_proc: 13h47m4.2s, tot_wait: 2h26m40.7s, rpc_num: 98507, rpc_time: 20h9m0.5s, copr_cache: disabled, distsql_concurrency: 15}, backoff{tikvRPC: 2.47s, regionMiss: 2.19s, pdRPC: 267ms}, tikv_task:{proc max:59.7s, min:0s, avg: 512.4ms, p80:308ms, p95:938ms, iters:338924, tasks:96899}, scan_detail: {total_process_keys: 29955968, total_process_keys_size: 4549646548, total_keys: 29972690, get_snapshot_time: 1m29.3s, rocksdb: {key_skipped_count: 33458, block: {cache_hit_count: 115909056, read_count: 77952, read_byte: 1.34 GB, read_time: 27m18.6s}}}  | keep order:false                                                                                                                                                    | N/A     | N/A  |
+----------------------------------+-------------+----------+-----------+-------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+------+
6 rows in set (8 min 1.24 sec)

| username: 人如其名 | Original post link

TiDB’s table lookup cost is quite high, and I believe it is significantly higher than traditional databases. Looking at the metrics, it requires scanning billions of blocks to find tens of millions of records.
total_keys: 29972690
cache_hit_count: 115909056

  1. The table lookup cost is inherently high (whether in distributed or traditional databases).

  2. Table lookup cost:

  3. The traditional heap table has the lowest table lookup cost because the RID is physical, and optimized scans can find multiple RIDs in a single block (I remember this can be done, to be confirmed, thus looking up multiple records may require far fewer blocks than the number of RIDs).

  4. In traditional B+TREE (clustered index) structure tables, the RID is logical (because it is ordered), but the cost of key lookup in B+TREE is lower compared to LSMTree, requiring fewer blocks to be scanned.

  5. In LSMTree structure tables, the RID is logical (because it is ordered), and the efficiency of looking up row records by primary key key is not as good as B+Tree. Here, each lookup requires reading 4 blocks, which is quite costly (whether TiDB sorts and deduplicates keys during table lookup needs to be confirmed).

| username: Raymond | Original post link

Thank you for the reply, expert.

  1. I have an idea that I am not sure is mature. In TiDB, if we use a secondary index range scan for table lookup, the larger the scan range, the more scattered the row IDs we get. This means that during table lookup, many seeks might be used instead of next. Is it possible to sort the row IDs before table lookup so that we can use next as much as possible during the lookup?

  2. Regarding statement 4, if we only specify using index join, TiDB’s performance shows that table ‘a’ is used as the inner table. I don’t quite understand this point (of course, if we don’t specify the hint, TiDB uses hash join, which is faster).

mysql> explain analyze select /*+ INL_JOIN(a,b) */ a.O_ORDERKEY, a.O_CUSTKEY, a.O_ORDERDATE, b.C_NAME from orders a join customer b on a.O_CUSTKEY = b.C_CUSTKEY;
+----------------------------------+-------------+----------+-----------+-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------+
| id                               | estRows     | actRows  | task      | access object                             | execution info                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | operator info                                                                                                                                                       | memory   | disk |
+----------------------------------+-------------+----------+-----------+-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------+
| IndexJoin_20                     | 29955968.00 | 29955968 | root      |                                           | time:5m11.9s, loops:29255, inner:{total:25m27.8s, concurrency:5, task:126, construct:2.48s, fetch:25m20.4s, build:4.88s}, probe:10.4s                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | inner join, inner:IndexLookUp_19, outer key:tpch.customer.c_custkey, inner key:tpch.orders.o_custkey, equal cond:eq(tpch.customer.c_custkey, tpch.orders.o_custkey) | 69.1 MB  | N/A  |
| ├─TableReader_31(Build)          | 3000000.00  | 3000000  | root      |                                           | time:168ms, loops:2942, cop_task: {num: 116, max: 6.03s, min: 2.46ms, avg: 312.7ms, p95: 1.23s, max_proc_keys: 50144, p95_proc_keys: 50144, tot_proc: 23.5s, tot_wait: 5.25s, rpc_num: 116, rpc_time: 36.3s, copr_cache: disabled, distsql_concurrency: 15}                                                                                                                                                                                                                                                                                                                                                                                                                          | data:TableFullScan_30                                                                                                                                               | 13.0 MB  | N/A  |
| │ └─TableFullScan_30             | 3000000.00  | 3000000  | cop[tikv] | table:b                                   | tikv_task:{proc max:5.75s, min:0s, avg: 199.1ms, p80:235ms, p95:717ms, iters:3387, tasks:116}, scan_detail: {total_process_keys: 3000000, total_process_keys_size: 610451426, total_keys: 3000116, get_snapshot_time: 2.4s, rocksdb: {key_skipped_count: 3000000, block: {cache_hit_count: 7844, read_count: 3203, read_byte: 71.6 MB, read_time: 2.59s}}}                                                                                                                                                                                                                                                                                                                           | keep order:false                                                                                                                                                    | N/A      | N/A  |
| └─IndexLookUp_19(Probe)          | 29955968.00 | 29955968 | root      |                                           | time:25m16.2s, loops:29445, index_task: {total_time: 21m0.6s, fetch_handle: 19m11.8s, build: 4.73ms, wait: 1m48.7s}, table_task: {total_time: 1h0m25.6s, num: 1902, concurrency: 5}, next: {wait_index: 3m21.6s, wait_table_lookup_build: 1.34s, wait_table_lookup_resp: 21m51.1s}                                                                                                                                                                                                                                                                                                                                                                                                   |                                                                                                                                                                     | 100.3 KB | N/A  |
|   ├─IndexRangeScan_17(Build)     | 29955968.00 | 29955968 | cop[tikv] | table:a, index:index_o_custkey(O_CUSTKEY) | time:19m10.5s, loops:29913, cop_task: {num: 2639, max: 5.57s, min: 2.9ms, avg: 603ms, p95: 1.99s, max_proc_keys: 50144, p95_proc_keys: 50144, tot_proc: 18m52.7s, tot_wait: 4m20.6s, rpc_num: 2639, rpc_time: 26m31.3s, copr_cache: disabled, distsql_concurrency: 15}, tikv_task:{proc max:5.06s, min:0s, avg: 428.6ms, p80:767ms, p95:1.67s, iters:39563, tasks:2639}, scan_detail: {total_process_keys: 29955968, total_process_keys_size: 1377974528, total_keys: 32958335, get_snapshot_time: 5.7s, rocksdb: {key_skipped_count: 29955968, block: {cache_hit_count: 28190101, read_count: 4508, read_byte: 40.3 MB, read_time: 33.7s}}}                                         | range: decided by [eq(tpch.orders.o_custkey, tpch.customer.c_custkey)], keep order:false                                                                            | N/A      | N/A  |
|   └─TableRowIDScan_18(Probe)     | 29955968.00 | 29955968 | cop[tikv] | table:a                                   | time:1h0m6.3s, loops:31425, cop_task: {num: 97002, max: 4.45s, min: 529.9µs, avg: 364.6ms, p95: 1.54s, max_proc_keys: 656, p95_proc_keys: 429, tot_proc: 6h49m25.4s, tot_wait: 2h22m11.3s, rpc_num: 97032, rpc_time: 9h49m33.1s, copr_cache: disabled, distsql_concurrency: 15}, backoff{regionMiss: 56ms}, tikv_task:{proc max:4.03s, min:0s, avg: 253.6ms, p80:452ms, p95:1.26s, iters:339294, tasks:97002}, scan_detail: {total_process_keys: 29955968, total_process_keys_size: 4549646548, total_keys: 29972627, get_snapshot_time: 8.46s, rocksdb: {key_skipped_count: 33331, block: {cache_hit_count: 176784267, read_count: 15276, read_byte: 272.8 MB, read_time: 32.4s}}}  | keep order:false                                                                                                                                                    | N/A      | N/A  |
+----------------------------------+-------------+----------+-----------+-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------+
6 rows in set (5 min 11.96 sec)

However, MySQL’s execution plan shows that table ‘a’ is used as the outer table.

mysql> explain analyze select a.O_ORDERKEY, a.O_CUSTKEY, a.O_ORDERDATE, b.C_NAME from orders a join customer b on a.O_CUSTKEY = b.C_CUSTKEY;

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                                                                                                                                                                         |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Nested loop inner join  (cost=18038574.80 rows=29529477) (actual time=0.093..388788.022 rows=30000000 loops=1)
    -> Table scan on a  (cost=3488289.09 rows=29529477) (actual time=0.075..270060.577 rows=30000000 loops=1)
    -> Single-row index lookup on b using PRIMARY (C_CUSTKEY=a.O_CUSTKEY)  (cost=0.39 rows=1) (actual time=0.004..0.004 rows=1 loops=30000000)
 |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (6 min 32.82 sec)

| username: 人如其名 | Original post link

  1. The first point is the third point I mentioned above that needs to be confirmed.
  2. The join reorder of tables is already confirmed in the logical optimization phase and is rule-based. There are cases where manually specifying the leading join order results in a lower final execution plan cost, but the optimizer does not follow it. Why does MySQL not have this issue while TiDB does? I guess MySQL’s join reorder algorithm is more advanced.
| username: Billmay表妹 | Original post link

Is there a way to optimize this?

| username: 人如其名 | Original post link

If this is the issue, then join reorder can be further optimized in the product, but it is a big project. However, we can take a step back and let users specify the desired join order themselves, but this is also not very easy. Refer to the post: Hint中表连接顺序LEADING应支持多表关联优先级和兼容物理优化阶段Join算法的选择 - TiDB 的问答社区

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.