Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.Original topic: 优化器未对多个来自同一张表的子查询条件做合并

[Problem Scenario Involved]
tpch fs=1;
alter table customer add index idx1(c_phone);
alter table orders add index idx1(o_orderdate);
For the statement:
SELECT
o_orderstatus
FROM
orders x
WHERE
o_orderdate = '1996-01-10'
AND o_custkey IN (
SELECT
c_custkey
FROM
customer a
WHERE
a.c_phone IN ('10-100-301-2651')
)
AND o_custkey IN (
SELECT
c_custkey
FROM
customer b
JOIN nation c ON b.c_nationkey = c.n_nationkey
WHERE
c.n_name IN ('CHINA')
);
Observe its execution plan:
mysql> set tidb_opt_insubq_to_join_and_agg=OFF;
Query OK, 0 rows affected (0.00 sec)
+------------------------------------+-----------+-----------+----------------------------------+------------------------------------------------------------------------------+
| id | estRows | task | access object | operator info |
+------------------------------------+-----------+-----------+----------------------------------+------------------------------------------------------------------------------+
| HashJoin_18 | 396.19 | root | | semi join, equal:[eq(tpch1.orders.o_custkey, tpch1.customer.c_custkey)] |
| ├─HashJoin_52(Build) | 150.00 | root | | inner join, equal:[eq(tpch1.nation.n_nationkey, tpch1.customer.c_nationkey)] |
| │ ├─TableReader_57(Build) | 0.03 | root | | data:Selection_56 |
| │ │ └─Selection_56 | 0.03 | cop[tikv] | | eq(tpch1.nation.n_name, "CHINA") |
| │ │ └─TableFullScan_55 | 25.00 | cop[tikv] | table:c | keep order:false, stats:pseudo |
| │ └─TableReader_54(Probe) | 150000.00 | root | | data:TableFullScan_53 |
| │ └─TableFullScan_53 | 150000.00 | cop[tikv] | table:b | keep order:false |
| └─HashJoin_30(Probe) | 495.23 | root | | semi join, equal:[eq(tpch1.orders.o_custkey, tpch1.customer.c_custkey)] |
| ├─IndexReader_38(Build) | 1.04 | root | | index:IndexRangeScan_37 |
| │ └─IndexRangeScan_37 | 1.04 | cop[tikv] | table:a, index:idx1(C_PHONE) | range:["10-100-301-2651","10-100-301-2651"], keep order:false |
| └─IndexLookUp_36(Probe) | 619.04 | root | | |
| ├─IndexRangeScan_34(Build) | 619.04 | cop[tikv] | table:x, index:idx1(O_ORDERDATE) | range:[1996-01-10,1996-01-10], keep order:false |
| └─TableRowIDScan_35(Probe) | 619.04 | cop[tikv] | table:x | keep order:false |
+------------------------------------+-----------+-----------+----------------------------------+------------------------------------------------------------------------------+
13 rows in set (0.01 sec)
It can be seen that the subquery for customer a is not merged with customer b, resulting in multiple associations with the customer table.
[Expected Behavior]
The manually optimized rewrite of this statement is:
SELECT
o_orderstatus
FROM
orders x
WHERE
o_orderdate = '1996-01-10'
and o_custkey IN (
SELECT
c_custkey
FROM
customer b
JOIN nation c ON b.c_nationkey = c.n_nationkey
WHERE
c.n_name IN ('CHINA') and b.c_phone IN ('10-100-301-2651')
);
Observe its execution plan:
+----------------------------------------+---------+-----------+----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id | estRows | task | access object | operator info |
+----------------------------------------+---------+-----------+----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| HashJoin_14 | 495.23 | root | | semi join, equal:[eq(tpch1.orders.o_custkey, tpch1.customer.c_custkey)] |
| ├─IndexJoin_28(Build) | 0.03 | root | | inner join, inner:TableReader_24, outer key:tpch1.customer.c_nationkey, inner key:tpch1.nation.n_nationkey, equal cond:eq(tpch1.customer.c_nationkey, tpch1.nation.n_nationkey) |
| │ ├─IndexLookUp_40(Build) | 1.04 | root | | |
| │ │ ├─IndexRangeScan_38(Build) | 1.04 | cop[tikv] | table:b, index:idx1(C_PHONE) | range:["10-100-301-2651","10-100-301-2651"], keep order:false |
| │ │ └─TableRowIDScan_39(Probe) | 1.04 | cop[tikv] | table:b | keep order:false |
| │ └─TableReader_24(Probe) | 0.00 | root | | data:Selection_23 |
| │ └─Selection_23 | 0.00 | cop[tikv] | | eq(tpch1.nation.n_name, "CHINA") |
| │ └─TableRangeScan_22 | 1.04 | cop[tikv] | table:c | range: decided by [tpch1.customer.c_nationkey], keep order:false, stats:pseudo |
| └─IndexLookUp_20(Probe) | 619.04 | root | | |
| ├─IndexRangeScan_18(Build) | 619.04 | cop[tikv] | table:x, index:idx1(O_ORDERDATE) | range:[1996-01-10,1996-01-10], keep order:false |
| └─TableRowIDScan_19(Probe) | 619.04 | cop[tikv] | table:x | keep order:false |
+----------------------------------------+---------+-----------+----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
It can be seen that the multiple associations with the customer table are reduced.
Therefore, can the optimizer be enhanced to automatically rewrite and optimize SQL to reduce repeated scans of the table?
More complex examples include: 因优化器问题导致TPCH的Q2语句执行过慢 - TiDB 的问答社区, where the and ps_supplycost
subquery is also merged, reducing repeated scans of the base table.