The optimizer does not merge conditions from multiple subqueries originating from the same table

translator_bot · June 21, 2024, 7:06pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 优化器未对多个来自同一张表的子查询条件做合并

| username: 人如其名

[Problem Scenario Involved]
tpch fs=1;

alter table customer add index idx1(c_phone);
alter table orders add index idx1(o_orderdate);

For the statement:

SELECT
  o_orderstatus
FROM
  orders x
WHERE
  o_orderdate = '1996-01-10'
  AND o_custkey IN (
    SELECT
      c_custkey
    FROM
      customer a
    WHERE
      a.c_phone IN ('10-100-301-2651')
  )
  AND o_custkey IN (
    SELECT
      c_custkey
    FROM
      customer b
      JOIN nation c ON b.c_nationkey = c.n_nationkey
    WHERE
      c.n_name IN ('CHINA')
  );

Observe its execution plan:

mysql> set tidb_opt_insubq_to_join_and_agg=OFF;
Query OK, 0 rows affected (0.00 sec)

+------------------------------------+-----------+-----------+----------------------------------+------------------------------------------------------------------------------+
| id                                 | estRows   | task      | access object                    | operator info                                                                |
+------------------------------------+-----------+-----------+----------------------------------+------------------------------------------------------------------------------+
| HashJoin_18                        | 396.19    | root      |                                  | semi join, equal:[eq(tpch1.orders.o_custkey, tpch1.customer.c_custkey)]      |
| ├─HashJoin_52(Build)               | 150.00    | root      |                                  | inner join, equal:[eq(tpch1.nation.n_nationkey, tpch1.customer.c_nationkey)] |
| │ ├─TableReader_57(Build)          | 0.03      | root      |                                  | data:Selection_56                                                            |
| │ │ └─Selection_56                 | 0.03      | cop[tikv] |                                  | eq(tpch1.nation.n_name, "CHINA")                                             |
| │ │   └─TableFullScan_55           | 25.00     | cop[tikv] | table:c                          | keep order:false, stats:pseudo                                               |
| │ └─TableReader_54(Probe)          | 150000.00 | root      |                                  | data:TableFullScan_53                                                        |
| │   └─TableFullScan_53             | 150000.00 | cop[tikv] | table:b                          | keep order:false                                                             |
| └─HashJoin_30(Probe)               | 495.23    | root      |                                  | semi join, equal:[eq(tpch1.orders.o_custkey, tpch1.customer.c_custkey)]      |
|   ├─IndexReader_38(Build)          | 1.04      | root      |                                  | index:IndexRangeScan_37                                                      |
|   │ └─IndexRangeScan_37            | 1.04      | cop[tikv] | table:a, index:idx1(C_PHONE)     | range:["10-100-301-2651","10-100-301-2651"], keep order:false                |
|   └─IndexLookUp_36(Probe)          | 619.04    | root      |                                  |                                                                              |
|     ├─IndexRangeScan_34(Build)     | 619.04    | cop[tikv] | table:x, index:idx1(O_ORDERDATE) | range:[1996-01-10,1996-01-10], keep order:false                              |
|     └─TableRowIDScan_35(Probe)     | 619.04    | cop[tikv] | table:x                          | keep order:false                                                             |
+------------------------------------+-----------+-----------+----------------------------------+------------------------------------------------------------------------------+
13 rows in set (0.01 sec)

It can be seen that the subquery for customer a is not merged with customer b, resulting in multiple associations with the customer table.
[Expected Behavior]
The manually optimized rewrite of this statement is:

SELECT
  o_orderstatus
FROM
  orders x
WHERE
  o_orderdate = '1996-01-10'
  and o_custkey IN (
    SELECT
      c_custkey
    FROM
      customer b
      JOIN nation c ON b.c_nationkey = c.n_nationkey
    WHERE
      c.n_name IN ('CHINA') and b.c_phone IN ('10-100-301-2651')
  );

Observe its execution plan:

+----------------------------------------+---------+-----------+----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id                                     | estRows | task      | access object                    | operator info                                                                                                                                                                   |
+----------------------------------------+---------+-----------+----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| HashJoin_14                            | 495.23  | root      |                                  | semi join, equal:[eq(tpch1.orders.o_custkey, tpch1.customer.c_custkey)]                                                                                                         |
| ├─IndexJoin_28(Build)                  | 0.03    | root      |                                  | inner join, inner:TableReader_24, outer key:tpch1.customer.c_nationkey, inner key:tpch1.nation.n_nationkey, equal cond:eq(tpch1.customer.c_nationkey, tpch1.nation.n_nationkey) |
| │ ├─IndexLookUp_40(Build)              | 1.04    | root      |                                  |                                                                                                                                                                                 |
| │ │ ├─IndexRangeScan_38(Build)         | 1.04    | cop[tikv] | table:b, index:idx1(C_PHONE)     | range:["10-100-301-2651","10-100-301-2651"], keep order:false                                                                                                                   |
| │ │ └─TableRowIDScan_39(Probe)         | 1.04    | cop[tikv] | table:b                          | keep order:false                                                                                                                                                                |
| │ └─TableReader_24(Probe)              | 0.00    | root      |                                  | data:Selection_23                                                                                                                                                               |
| │   └─Selection_23                     | 0.00    | cop[tikv] |                                  | eq(tpch1.nation.n_name, "CHINA")                                                                                                                                                |
| │     └─TableRangeScan_22              | 1.04    | cop[tikv] | table:c                          | range: decided by [tpch1.customer.c_nationkey], keep order:false, stats:pseudo                                                                                                  |
| └─IndexLookUp_20(Probe)                | 619.04  | root      |                                  |                                                                                                                                                                                 |
|   ├─IndexRangeScan_18(Build)           | 619.04  | cop[tikv] | table:x, index:idx1(O_ORDERDATE) | range:[1996-01-10,1996-01-10], keep order:false                                                                                                                                 |
|   └─TableRowIDScan_19(Probe)           | 619.04  | cop[tikv] | table:x                          | keep order:false                                                                                                                                                                |
+----------------------------------------+---------+-----------+----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

It can be seen that the multiple associations with the customer table are reduced.

Therefore, can the optimizer be enhanced to automatically rewrite and optimize SQL to reduce repeated scans of the table?
More complex examples include: 因优化器问题导致TPCH的Q2语句执行过慢 - TiDB 的问答社区, where the and ps_supplycost subquery is also merged, reducing repeated scans of the base table.

translator_bot · June 21, 2024, 7:06pm

| username: 像风一样的男子 | Original post link

Good suggestion, I support it.

translator_bot · June 21, 2024, 7:06pm

| username: 人如其名 | Original post link

In real-world scenarios, many SQL statements are generated by programs, and developers generally write SQL to meet business logic requirements without paying much attention to optimization. Therefore, rewrite optimization by the database is also quite important.

translator_bot · June 21, 2024, 7:06pm

| username: Kongdom | Original post link

This is a real failover. It’s clearly the developers’ non-standard development, and in the end, the DBA has to cover for it~

translator_bot · June 21, 2024, 7:06pm

| username: 啦啦啦啦啦 | Original post link

+1, I’ve seen some programs generate very silly SQL, but the developers say they can’t change it, so we can only rely on the database’s own optimization.