Specified INL_JOIN, but the execution plan does not follow

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 指定INL_JOIN,但执行计划并不走

| username: 人如其名

[TiDB Usage Environment] Poc
[TiDB Version] 6.3.0
[Encountered Problem] Not using the specified INL_JOIN, execution efficiency is too slow
[Problem Phenomenon and Impact]

The table structure is as follows:

mysql> show create table customer \G
*************************** 1. row ***************************
       Table: customer
Create Table: CREATE TABLE `customer` (
  `C_CUSTKEY` bigint(20) NOT NULL,
  `C_NAME` varchar(25) NOT NULL,
  `C_ADDRESS` varchar(40) NOT NULL,
  `C_NATIONKEY` bigint(20) NOT NULL,
  `C_PHONE` char(15) NOT NULL,
  `C_ACCTBAL` decimal(15,2) NOT NULL,
  `C_MKTSEGMENT` char(10) NOT NULL,
  `C_COMMENT` varchar(117) NOT NULL,
  PRIMARY KEY (`C_CUSTKEY`) /*T![clustered_index] CLUSTERED */,
  KEY `customer_idx1` (`C_PHONE`),
  KEY `customer_idx2` (`C_NATIONKEY`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin
1 row in set (0.00 sec)

mysql> show create table orders \G
*************************** 1. row ***************************
       Table: orders
Create Table: CREATE TABLE `orders` (
  `O_ORDERKEY` bigint(20) NOT NULL,
  `O_CUSTKEY` bigint(20) NOT NULL,
  `O_ORDERSTATUS` char(1) NOT NULL,
  `O_TOTALPRICE` decimal(15,2) NOT NULL,
  `O_ORDERDATE` date NOT NULL,
  `O_ORDERPRIORITY` char(15) NOT NULL,
  `O_CLERK` char(15) NOT NULL,
  `O_SHIPPRIORITY` bigint(20) NOT NULL,
  `O_COMMENT` varchar(79) NOT NULL,
  PRIMARY KEY (`O_ORDERKEY`) /*T![clustered_index] CLUSTERED */,
  KEY `orders_idx1` (`O_ORDERDATE`),
  KEY `o_custkey` (`O_CUSTKEY`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin
1 row in set (0.00 sec)

The execution plan of the following SQL statement:

mysql> explain select a.C_NAME,b.price from tpch.customer a left join (select o_custkey,sum(O_TOTALPRICE) as price from tpch.orders group by o_custkey) b on a.C_CUSTKEY=b.O_CUSTKEY where a.C_PHONE='23-768-687-3665';
+------------------------------------+-------------+-----------+---------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+
| id                                 | estRows     | task      | access object                         | operator info                                                                                                                 |
+------------------------------------+-------------+-----------+---------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+
| HashJoin_13                        | 1.02        | root      |                                       | left outer join, equal:[eq(tpch.customer.c_custkey, tpch.orders.o_custkey)]                                                   |
| ├─IndexLookUp_31(Build)            | 1.02        | root      |                                       |                                                                                                                               |
| │ ├─IndexRangeScan_29(Build)       | 1.02        | cop[tikv] | table:a, index:customer_idx1(C_PHONE) | range:["23-768-687-3665","23-768-687-3665"], keep order:false                                                                 |
| │ └─TableRowIDScan_30(Probe)       | 1.02        | cop[tikv] | table:a                               | keep order:false                                                                                                              |
| └─HashAgg_38(Probe)                | 1009664.00  | root      |                                       | group by:tpch.orders.o_custkey, funcs:sum(Column#21)->Column#18, funcs:firstrow(tpch.orders.o_custkey)->tpch.orders.o_custkey |
|   └─TableReader_39                 | 1009664.00  | root      |                                       | data:HashAgg_32                                                                                                               |
|     └─HashAgg_32                   | 1009664.00  | cop[tikv] |                                       | group by:tpch.orders.o_custkey, funcs:sum(tpch.orders.o_totalprice)->Column#21                                                |
|       └─TableFullScan_37           | 15000000.00 | cop[tikv] | table:orders                          | keep order:false                                                                                                              |
+------------------------------------+-------------+-----------+---------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+
8 rows in set (0.01 sec)

I understand that the optimal execution plan path is to first filter table a through the customer_idx1 index, then use INL_JOIN to join table b (using the o_custkey index of table b), and then use the o_custkey index of table b to perform a sum(O_TOTALPRICE) using index back table. This would be the optimal execution path, but in reality, it does not follow this path. So I manually added a hint to force it to follow this path:

mysql> explain select /*+ INL_JOIN(a,b) */ a.C_NAME,b.price from tpch.customer a left join (select o_custkey,sum(O_TOTALPRICE) as price from tpch.orders group by o_custkey) b on a.C_CUSTKEY=b.O_CUSTKEY where a.C_PHONE='23-768-687-3665';
+------------------------------------+-------------+-----------+---------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+
| id                                 | estRows     | task      | access object                         | operator info                                                                                                                 |
+------------------------------------+-------------+-----------+---------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+
| HashJoin_12                        | 1.02        | root      |                                       | left outer join, equal:[eq(tpch.customer.c_custkey, tpch.orders.o_custkey)]                                                   |
| ├─IndexLookUp_30(Build)            | 1.02        | root      |                                       |                                                                                                                               |
| │ ├─IndexRangeScan_28(Build)       | 1.02        | cop[tikv] | table:a, index:customer_idx1(C_PHONE) | range:["23-768-687-3665","23-768-687-3665"], keep order:false                                                                 |
| │ └─TableRowIDScan_29(Probe)       | 1.02        | cop[tikv] | table:a                               | keep order:false                                                                                                              |
| └─HashAgg_37(Probe)                | 1009664.00  | root      |                                       | group by:tpch.orders.o_custkey, funcs:sum(Column#21)->Column#18, funcs:firstrow(tpch.orders.o_custkey)->tpch.orders.o_custkey |
|   └─TableReader_38                 | 1009664.00  | root      |                                       | data:HashAgg_31                                                                                                               |
|     └─HashAgg_31                   | 1009664.00  | cop[tikv] |                                       | group by:tpch.orders.o_custkey, funcs:sum(tpch.orders.o_totalprice)->Column#21                                                |
|       └─TableFullScan_36           | 15000000.00 | cop[tikv] | table:orders                          | keep order:false                                                                                                              |
+------------------------------------+-------------+-----------+---------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+
8 rows in set, 2 warnings (0.00 sec)

mysql> show warnings;
+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| Level   | Code | Message                                                                                                                                             |
+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| Warning | 1815 | There are no matching table names for (b) in optimizer hint /*+ INL_JOIN(a, b) */ or /*+ TIDB_INLJ(a, b) */ Maybe you can use the table alias name |
| Warning | 1815 | Optimizer Hint /*+ INL_JOIN(a, b) */ or /*+ TIDB_INLJ(a, b) */ is inapplicable                                                                      |
+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

As you can see, it does not follow the expected result. The warnings indicate that this execution plan cannot be specified in this way.

Next, we execute to check the time taken (to avoid hash_agg causing OOM, hash_agg is performed in non-parallel mode here):

mysql> set tidb_hashagg_partial_concurrency=1;
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> select a.C_NAME,b.price from tpch.customer a left join (select o_custkey,sum(O_TOTALPRICE) as price from tpch.orders group by o_custkey) b on a.C_CUSTKEY=b.O_CUSTKEY where a.C_PHONE='23-768-687-3665';
+--------------------+-----------+
| C_NAME             | price     |
+--------------------+-----------+
| Customer#000000002 | 481044.54 |
+--------------------+-----------+
1 row in set (10.86 sec)

It can be seen that it took more than 10 seconds to execute.
I imported the same data into DB2 to observe its execution plan and execution time (due to the large output of the execution plan, only partial key information is posted here):

Optimizer Plan:

                        Rows   
                      Operator 
                        (ID)   
                        Cost   
                              
                         1    
                      RETURN  
                       ( 1)   
                      63.7538 
                        |     
                         1    
                       GRPBY  
                       ( 2)   
                      63.7531 
                        |     
                      9.95056 
                      NLJOIN  
                       ( 3)   
                      63.7507 
                  /--/       \---\
          0.995056                  10    
           TBSCAN                  FETCH  
            ( 4)                   (--)   
          20.4589                 43.2918 
             |                   /       \
          0.995056             10     1.5e+07 
            SORT             RIDSCN   Table:  
            ( 5)              ( 9)    TPCH    
          20.4575            13.6895  ORDERS  
             |                 |     
          0.995056             10    
           FETCH              SORT   
            ( 6)              (10)   
           20.455            13.6884 
         /        \            |    
    0.995056    1.5e+06        10   
     IXSCAN     Table:       IXSCAN 
      ( 7)      TPCH          (11)  
    13.6718     CUSTOMER     13.685 
      |                       |       
  1.49996e+06               999982    
 Index:                   Index:      
 TPCH                     TPCH        
 CUSTOMER_IDX1            ORDERS_IDX2 


[db2inst1@host1 ~]$ time db2 "select a.C_NAME,b.price from tpch.customer a left join (select o_custkey,sum(O_TOTALPRICE) as price from tpch.orders group by o_custkey) b on a.C_CUSTKEY=b.O_CUSTKEY where a.C_PHONE='23-768-687-3665'"                    

C_NAME                    PRICE                            
------------------------- ---------------------------------
Customer#000000002                                481044.54

  1 record(s) selected.


real    0m0.034s
user    0m0.008s
sys     0m0.018s

It can be seen that DB2 automatically used the correct execution plan and is very efficient.

So the question here is, in the presence of indexes, why does manually specifying the hint to use INL_JOIN not work and show that this join is not applicable?

| username: buddyyuan | Original post link

You can try to modify it like this.

mysql> explain with b as (select id,count(1) as num from sbtest2 t2 group by t2.id) select a.id ,b.num from sbtest1 a, b where a.id=b.id;
+--------------------------------+----------+-----------+------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+
| id                             | estRows  | task      | access object          | operator info                                                                                                                                   |
+--------------------------------+----------+-----------+------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+
| Projection_19                  | 44513.60 | root      |                        | sbtest.sbtest1.id, Column#11                                                                                                                    |
| └─IndexHashJoin_26             | 44513.60 | root      |                        | inner join, inner:TableReader_21, outer key:sbtest.sbtest2.id, inner key:sbtest.sbtest1.id, equal cond:eq(sbtest.sbtest2.id, sbtest.sbtest1.id) |
|   ├─Selection_31(Build)        | 44513.60 | root      |                        | not(isnull(sbtest.sbtest2.id))                                                                                                                  |
|   │ └─CTEFullScan_32           | 55642.00 | root      | CTE:b                  | data:CTE_0                                                                                                                                      |
|   └─TableReader_21(Probe)      | 1.00     | root      |                        | data:TableRangeScan_20                                                                                                                          |
|     └─TableRangeScan_20        | 1.00     | cop[tikv] | table:a                | range: decided by [sbtest.sbtest2.id], keep order:false                                                                                         |
| CTE_0                          | 55642.00 | root      |                        | Non-Recursive CTE                                                                                                                               |
| └─Projection_11(Seed Part)     | 55642.00 | root      |                        | sbtest.sbtest2.id, 1->Column#5                                                                                                                  |
|   └─IndexReader_15             | 55642.00 | root      |                        | index:IndexFullScan_14                                                                                                                          |
|     └─IndexFullScan_14         | 55642.00 | cop[tikv] | table:t2, index:k_2(k) | keep order:false                                                                                                                                |
+--------------------------------+----------+-----------+------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+
10 rows in set (0.01 sec)
| username: cs58_dba | Original post link

Does the syntax of 8.0 also support this?

| username: buddyyuan | Original post link

8.0? I am using TiDB version 5.1.4.

| username: cs58_dba | Original post link

The “WITH” clause is a syntax introduced in MySQL 8.0.

| username: 人如其名 | Original post link

Looking at the execution plan, the CTE is executed separately first, which is still very inefficient. Actually, my original business statement used the WITH structure, which I stripped out.

| username: 近墨者zyl | Original post link

Collect the statistics again?

| username: 人如其名 | Original post link

The statistics are accurate, and I also specified a hint.

| username: buddyyuan | Original post link

At present, it seems that it is not possible. The CTE part needs to be executed independently.

mysql> explain with b as (select id, sum(k) as num from sbtest1 group by id), a as (select * from sbtest2) select a.id, b.num from a, b where a.id = b.id;
+--------------------------------+-------------+-----------+-----------------------------+---------------------------------------------------------------------------+
| id                             | estRows     | task      | access object               | operator info                                                             |
+--------------------------------+-------------+-----------+-----------------------------+---------------------------------------------------------------------------+
| HashJoin_28                    | 641.82      | root      |                             | inner join, equal:[eq(sbtest.sbtest2.id, sbtest.sbtest1.id)]              |
| ├─Selection_29(Build)          | 640.00      | root      |                             | not(isnull(sbtest.sbtest2.id))                                            |
| │ └─CTEFullScan_30             | 800.00      | root      | CTE:a                       | data:CTE_1                                                                |
| └─Selection_31(Probe)          | 8000000.00  | root      |                             | not(isnull(sbtest.sbtest1.id))                                            |
|   └─CTEFullScan_32             | 10000000.00 | root      | CTE:b                       | data:CTE_0                                                                |
| CTE_1                          | 800.00      | root      |                             | Non-Recursive CTE                                                         |
| └─Selection_14(Seed Part)      | 800.00      | root      |                             | not(isnull(sbtest.sbtest2.id))                                            |
|   └─TableReader_17             | 1000.00     | root      |                             | data:TableFullScan_16                                                     |
|     └─TableFullScan_16         | 1000.00     | cop[tikv] | table:sbtest2               | keep order:false                                                          |
| CTE_0                          | 10000000.00 | root      |                             | Non-Recursive CTE                                                         |
| └─Projection_20(Seed Part)     | 10000000.00 | root      |                             | sbtest.sbtest1.id, cast(sbtest.sbtest1.k, decimal(32,0) BINARY)->Column#5 |
|   └─IndexReader_24             | 10000000.00 | root      |                             | index:IndexFullScan_23                                                    |
|     └─IndexFullScan_23         | 10000000.00 | cop[tikv] | table:sbtest1, index:k_1(k) | keep order:false                                                          |
+--------------------------------+-------------+-----------+-----------------------------+---------------------------------------------------------------------------+
13 rows in set (0.00 sec)
| username: 人如其名 | Original post link

From this perspective, it seems that SQL statements in the form of select a, sum(b) from table group by a do not use indexes, and there are no warnings either.

mysql> explain select /*+ USE_INDEX(@sel_2 x@sel_2,o_custkey) */ * from (select o_custkey, sum(O_TOTALPRICE) as price from tpch.orders x group by o_custkey) as b;
+------------------------------+-------------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------+
| id                           | estRows     | task      | access object | operator info                                                                                                                 |
+------------------------------+-------------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------+
| Projection_10                | 1009664.00  | root      |               | tpch.orders.o_custkey, Column#10                                                                                              |
| └─HashAgg_14                 | 1009664.00  | root      |               | group by:tpch.orders.o_custkey, funcs:sum(Column#11)->Column#10, funcs:firstrow(tpch.orders.o_custkey)->tpch.orders.o_custkey |
|   └─TableReader_15           | 1009664.00  | root      |               | data:HashAgg_16                                                                                                               |
|     └─HashAgg_16             | 1009664.00  | cop[tikv] |               | group by:tpch.orders.o_custkey, funcs:sum(tpch.orders.o_totalprice)->Column#11                                                |
|       └─TableFullScan_13     | 15000000.00 | cop[tikv] | table:x       | keep order:false                                                                                                              |
+------------------------------+-------------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------+
5 rows in set (0.00 sec)


mysql> explain select  * from (select /*+ USE_INDEX(x,o_custkey) */ o_custkey, sum(O_TOTALPRICE) as price from tpch.orders x group by o_custkey) as b;
+------------------------------+-------------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------+
| id                           | estRows     | task      | access object | operator info                                                                                                                 |
+------------------------------+-------------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------+
| Projection_10                | 1009664.00  | root      |               | tpch.orders.o_custkey, Column#10                                                                                              |
| └─HashAgg_14                 | 1009664.00  | root      |               | group by:tpch.orders.o_custkey, funcs:sum(Column#11)->Column#10, funcs:firstrow(tpch.orders.o_custkey)->tpch.orders.o_custkey |
|   └─TableReader_15           | 1009664.00  | root      |               | data:HashAgg_16                                                                                                               |
|     └─HashAgg_16             | 1009664.00  | cop[tikv] |               | group by:tpch.orders.o_custkey, funcs:sum(tpch.orders.o_totalprice)->Column#11                                                |
|       └─TableFullScan_13     | 15000000.00 | cop[tikv] | table:x       | keep order:false                                                                                                              |
+------------------------------+-------------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------+
5 rows in set (0.00 sec)

mysql> explain select /*+ USE_INDEX(@sel_2 x@sel_2,o_custkey) */ * from (select o_custkey from tpch.orders x group by o_custkey) as b;
+----------------------------+-------------+-----------+-------------------------------------+----------------------------------------------------------------------------------------------+
| id                         | estRows     | task      | access object                       | operator info                                                                                |
+----------------------------+-------------+-----------+-------------------------------------+----------------------------------------------------------------------------------------------+
| HashAgg_18                 | 1009664.00  | root      |                                     | group by:tpch.orders.o_custkey, funcs:firstrow(tpch.orders.o_custkey)->tpch.orders.o_custkey |
| └─IndexReader_19           | 1009664.00  | root      |                                     | index:HashAgg_20                                                                             |
|   └─HashAgg_20             | 1009664.00  | cop[tikv] |                                     | group by:tpch.orders.o_custkey,                                                              |
|     └─IndexFullScan_17     | 15000000.00 | cop[tikv] | table:x, index:o_custkey(O_CUSTKEY) | keep order:false                                                                             |
+----------------------------+-------------+-----------+-------------------------------------+----------------------------------------------------------------------------------------------+
4 rows in set (0.00 sec)

The o_custkey has a low repetition rate, and it does not use indexes here either.

Total number of records in the table: 15 million
mysql> select * from mysql.stats_meta where table_id = (select tidb_table_id from information_schema.tables a where a.table_name='orders' and a.table_schema='tpch');
+--------------------+----------+--------------+----------+--------------------+
| version            | table_id | modify_count | count    | snapshot           |
+--------------------+----------+--------------+----------+--------------------+
| 436371908666064898 |      129 |            0 | 15000000 | 436371895938449409 |
+--------------------+----------+--------------+----------+--------------------+
1 row in set (0.02 sec)

The cardinality of the O_CUSTKEY field is 1 million, and the selectivity is basically 1/15
mysql> select distinct_count, null_count, modify_count from stats_histograms where (table_id, hist_id) in (select tidb_table_id, ORDINAL_POSITION from information_schema.tables a, information_schema.columns b where a.table_name=b.table_name and b.table_name='orders' and b.COLUMN_NAME='O_CUSTKEY' and b.TABLE_SCHEMA='tpch');
+----------------+------------+--------------+
| distinct_count | null_count | modify_count |
+----------------+------------+--------------+
|        1009664 |          0 |            0 |
+----------------+------------+--------------+
1 row in set (0.06 sec)

In fact, if it uses indexes here, the efficiency should be very high, but it still does not. One possible reason why the above join condition does not use INL_JOIN is that the b table cannot utilize indexes. See the following SQL execution plan (does not use indexes):
mysql> explain select o_custkey, sum(O_TOTALPRICE) as price from tpch.orders x where o_custkey=369001 group by o_custkey;
+--------------------------------+-------------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------+
| id                             | estRows     | task      | access object | operator info                                                                                                                 |
+--------------------------------+-------------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------+
| Projection_10                  | 807731.20   | root      |               | tpch.orders.o_custkey, Column#10                                                                                              |
| └─HashAgg_15                   | 807731.20   | root      |               | group by:tpch.orders.o_custkey, funcs:sum(Column#11)->Column#10, funcs:firstrow(tpch.orders.o_custkey)->tpch.orders.o_custkey |
|   └─TableReader_16             | 807731.20   | root      |               | data:HashAgg_17                                                                                                               |
|     └─HashAgg_17               | 807731.20   | cop[tikv] |               | group by:tpch.orders.o_custkey, funcs:sum(tpch.orders.o_totalprice)->Column#11                                                |
|       └─Selection_13           | 12000000.00 | cop[tikv] |               | eq(tpch.orders.o_custkey, 369001)                                                                                             |
|         └─TableFullScan_14     | 15000000.00 | cop[tikv] | table:x       | keep order:false                                                                                                              |
+--------------------------------+-------------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------+
6 rows in set (0.00 sec)

Is this a defect of the optimizer?
| username: 人如其名 | Original post link

So this doesn’t make much sense, CTE can’t solve this problem.

| username: xuyifangreeneyes | Original post link

The reason why this query is not effective even with INL_JOIN is that currently IndexJoin only supports the inner side being a DataSource, and does not support the inner side being Agg->DataSource, while b is in the form of Agg->DataSource.

| username: xuyifangreeneyes | Original post link

One workaround is to rewrite the left join of this query into a subquery form and prevent the optimizer from decorrelating the subquery. In this case, the Apply operator is used to execute (fetch a row from a and then find matching rows in b), which can utilize the o_custkey index of orders.

mysql> explain select
    ->     a.C_NAME,
    ->     (
    ->         select
    ->             sum(O_TOTALPRICE)
    ->         from
    ->             tpch.orders b
    ->         where
    ->             b.O_CUSTKEY = a.C_CUSTKEY
    ->         group by
    ->             o_custkey
    ->     ) as price
    -> from
    ->     tpch.customer a use index(customer_idx1)
    -> where
    ->     a.C_PHONE = '23-768-687-3665';
+------------------------------------------+---------+-----------+---------------------------------------+-------------------------------------------------------------------------------------------------------+
| id                                       | estRows | task      | access object                         | operator info                                                                                         |
+------------------------------------------+---------+-----------+---------------------------------------+-------------------------------------------------------------------------------------------------------+
| Projection_11                            | 0.00    | root      |                                       | tpch.customer.c_name, Column#27                                                                       |
| └─Apply_13                               | 0.00    | root      |                                       | CARTESIAN left outer join                                                                             |
|   ├─IndexLookUp_16(Build)                | 1.00    | root      |                                       |                                                                                                       |
|   │ ├─IndexRangeScan_14(Build)           | 1.00    | cop[tikv] | table:a, index:customer_idx1(C_PHONE) | range:["23-768-687-3665","23-768-687-3665"], keep order:false, stats:pseudo                           |
|   │ └─TableRowIDScan_15(Probe)           | 1.00    | cop[tikv] | table:a                               | keep order:false, stats:pseudo                                                                        |
|   └─MaxOneRow_17(Probe)                  | 1.00    | root      |                                       |                                                                                                       |
|     └─StreamAgg_22                       | 2.00    | root      |                                       | group by:tpch.orders.o_custkey, funcs:sum(tpch.orders.o_totalprice)->Column#27                        |
|       └─Projection_40                    | 2.50    | root      |                                       | tpch.orders.o_custkey, tpch.orders.o_totalprice                                                       |
|         └─IndexLookUp_39                 | 2.50    | root      |                                       |                                                                                                       |
|           ├─IndexRangeScan_37(Build)     | 2.50    | cop[tikv] | table:b, index:o_custkey(O_CUSTKEY)   | range: decided by [eq(tpch.orders.o_custkey, tpch.customer.c_custkey)], keep order:true, stats:pseudo |
|           └─TableRowIDScan_38(Probe)     | 2.50    | cop[tikv] | table:b                               | keep order:false, stats:pseudo                                                                        |
+------------------------------------------+---------+-----------+---------------------------------------+-------------------------------------------------------------------------------------------------------+
11 rows in set (0.00 sec)

Whether or not to add group by o_custkey in the subquery, the semantics are the same. Adding group by o_custkey here is to prevent the optimizer from decorrelating (converting Apply to Join).

Alternatively, after the introduction of the NO_DECORRELATE hint in version 6.4, you can remove group by o_custkey and add this hint to avoid decorrelation.

mysql> explain select
    ->     a.C_NAME,
    ->     (
    ->         select /*+ NO_DECORRELATE() */
    ->             sum(O_TOTALPRICE)
    ->         from
    ->             tpch.orders b
    ->         where
    ->             b.O_CUSTKEY = a.C_CUSTKEY
    ->     ) as price
    -> from
    ->     tpch.customer a use index(customer_idx1)
    -> where
    ->     a.C_PHONE = '23-768-687-3665';
+----------------------------------------+---------+-----------+---------------------------------------+--------------------------------------------------------------------------------------------------------+
| id                                     | estRows | task      | access object                         | operator info                                                                                          |
+----------------------------------------+---------+-----------+---------------------------------------+--------------------------------------------------------------------------------------------------------+
| Projection_11                          | 0.00    | root      |                                       | tpch.customer.c_name, Column#27                                                                        |
| └─Apply_13                             | 0.00    | root      |                                       | CARTESIAN left outer join                                                                              |
|   ├─IndexLookUp_16(Build)              | 1.00    | root      |                                       |                                                                                                        |
|   │ ├─IndexRangeScan_14(Build)         | 1.00    | cop[tikv] | table:a, index:customer_idx1(C_PHONE) | range:["23-768-687-3665","23-768-687-3665"], keep order:false, stats:pseudo                            |
|   │ └─TableRowIDScan_15(Probe)         | 1.00    | cop[tikv] | table:a                               | keep order:false, stats:pseudo                                                                         |
|   └─MaxOneRow_17(Probe)                | 1.00    | root      |                                       |                                                                                                        |
|     └─HashAgg_29                       | 1.00    | root      |                                       | funcs:sum(Column#30)->Column#27                                                                        |
|       └─IndexLookUp_30                 | 1.00    | root      |                                       |                                                                                                        |
|         ├─IndexRangeScan_27(Build)     | 10.00   | cop[tikv] | table:b, index:o_custkey(O_CUSTKEY)   | range: decided by [eq(tpch.orders.o_custkey, tpch.customer.c_custkey)], keep order:false, stats:pseudo |
|         └─HashAgg_19(Probe)            | 1.00    | cop[tikv] |                                       | funcs:sum(tpch.orders.o_totalprice)->Column#30                                                         |
|           └─TableRowIDScan_28          | 10.00   | cop[tikv] | table:b                               | keep order:false, stats:pseudo                                                                         |
+----------------------------------------+---------+-----------+---------------------------------------+--------------------------------------------------------------------------------------------------------+
11 rows in set (0.00 sec)
| username: 人如其名 | Original post link

Thank you very much for the explanation, it is very clear. As a follow-up question, will the inner side of this index join also consider supporting comparisons like Agg->DataSource in the future?

| username: xuyifangreeneyes | Original post link

In fact, the execution methods of index join and apply are similar, both in the form of nested loops, with index join being more efficient. For subqueries in the form of Agg->DataSource that cannot use index join, using the NO_DECORRELATE hint to use the apply operator can solve most problems. However, for queries like this that are originally joins, you need to rewrite them into subquery form to use apply. It might be better to allow index join to support the Agg->DataSource form on the inner side.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

| username: 人如其名 | Original post link

To add to this, the following PR has already been fixed: planner, executor: support aggregation as inner side of index join by elsa0520 · Pull Request #51354 · pingcap/tidb · GitHub
It should be reflected in versions after 8.0.

| username: 洪七表哥 | Original post link

Learned.