Order Issues of Inner Join and Outer Join in Join Reorder

translator_bot · June 20, 2024, 3:46pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: join reorder中inner join和outer join顺序问题

| username: 人如其名

[TiDB Usage Environment] Test
[TiDB Version] v7.5.1

For an SQL statement: select * from a inner join b on a.c1=b.id left join c on a.c2=c.id;
If the result set of a join b is larger than the result set of a join c, the optimizer will evaluate the execution order as: a left join c inner join b. This is common when there is a lot of duplicate data in the join condition of table b. However, if it can be determined that the result set of a inner join b is at most no larger than count(a) (for example, the join field of table b is a primary key), then performing a inner join b first is optimal in most cases.
But from the test, TiDB does not seem to optimize this part. The test is as follows:

Test SQL:

drop table if exists a;
drop table if exists b;
drop table if exists c;

create table a(id int,c1 int,c2 int,primary key (id));
create table b(id int,c1 int,c2 int,primary key (id));
create table c(id int,c1 int,c2 int,primary key (id));

insert into a values (1,1,1),(2,2,2),(3,3,3);
insert into b values (1,1,1),(10,10,10),(11,11,11),(12,12,12);
insert into c values (1,1,1),(2,2,2);

analyze table a,b,c with 1 samplerate;

Perform the following query:

select * from a inner join b on a.c1=b.id left join c on a.c2=c.id;

The desired optimal join order is a inner join b left join c, but checking the execution plan, the optimizer gives the order as a left join c inner join b:

mysql> show variables like 'tidb_enable_outer_join_reorder';
+--------------------------------+-------+
| Variable_name                  | Value |
+--------------------------------+-------+
| tidb_enable_outer_join_reorder | ON    |
+--------------------------------+-------+
1 row in set (0.00 sec)

mysql> explain select * from a inner join b on a.c1=b.id left join c on a.c2=c.id;
+-----------------------------------+---------+-----------+---------------+---------------------------------------------------------------------------------------------------+
| id                                | estRows | task      | access object | operator info                                                                                     |
+-----------------------------------+---------+-----------+---------------+---------------------------------------------------------------------------------------------------+
| Projection_12                     | 3.00    | root      |               | test.a.id, test.a.c1, test.a.c2, test.b.id, test.b.c1, test.b.c2, test.c.id, test.c.c1, test.c.c2 |
| └─HashJoin_23                     | 3.00    | root      |               | inner join, equal:[eq(test.a.c1, test.b.id)]                                                      |
|   ├─HashJoin_33(Build)            | 3.00    | root      |               | left outer join, equal:[eq(test.a.c2, test.c.id)]                                                 |
|   │ ├─TableReader_39(Build)       | 2.00    | root      |               | data:TableFullScan_38                                                                             |
|   │ │ └─TableFullScan_38          | 2.00    | cop[tikv] | table:c       | keep order:false                                                                                  |
|   │ └─TableReader_37(Probe)       | 3.00    | root      |               | data:Selection_36                                                                                 |
|   │   └─Selection_36              | 3.00    | cop[tikv] |               | not(isnull(test.a.c1))                                                                            |
|   │     └─TableFullScan_35        | 3.00    | cop[tikv] | table:a       | keep order:false                                                                                  |
|   └─TableReader_41(Probe)         | 4.00    | root      |               | data:TableFullScan_40                                                                             |
|     └─TableFullScan_40            | 4.00    | cop[tikv] | table:b       | keep order:false                                                                                  |
+-----------------------------------+---------+-----------+---------------+---------------------------------------------------------------------------------------------------+
10 rows in set (0.00 sec)

mysql> explain select /*+ leading(a,b,c) */ * from a inner join b on a.c1=b.id left join c on a.c2=c.id;
+--------------------------------+---------+-----------+---------------+---------------------------------------------------+
| id                             | estRows | task      | access object | operator info                                     |
+--------------------------------+---------+-----------+---------------+---------------------------------------------------+
| HashJoin_20                    | 3.00    | root      |               | left outer join, equal:[eq(test.a.c2, test.c.id)] |
| ├─TableReader_39(Build)        | 2.00    | root      |               | data:TableFullScan_38                             |
| │ └─TableFullScan_38           | 2.00    | cop[tikv] | table:c       | keep order:false                                  |
| └─HashJoin_32(Probe)           | 3.00    | root      |               | inner join, equal:[eq(test.a.c1, test.b.id)]      |
|   ├─TableReader_35(Build)      | 3.00    | root      |               | data:Selection_34                                 |
|   │ └─Selection_34             | 3.00    | cop[tikv] |               | not(isnull(test.a.c1))                            |
|   │   └─TableFullScan_33       | 3.00    | cop[tikv] | table:a       | keep order:false                                  |
|   └─TableReader_37(Probe)      | 4.00    | root      |               | data:TableFullScan_36                             |
|     └─TableFullScan_36         | 4.00    | cop[tikv] | table:b       | keep order:false                                  |
+--------------------------------+---------+-----------+---------------+---------------------------------------------------+
9 rows in set (0.00 sec)

From the actual data, it can be seen that if a inner join b is performed first, many invalid data can be filtered out, and then the final result can be obtained by joining with c.
However, if a left join c is performed first, it will still retain no less than the total number of records in table a, and then join with b, which overall scans more data than performing a inner join b first.

Therefore, can such an optimization be made: for a inner join b left join c, in the case of multiple inner joins and outer joins, if it can be determined that the result set of a inner join b is not higher than a, then the inner join should be performed first to filter more data earlier and reduce data scanning?

translator_bot · June 20, 2024, 3:46pm

| username: yytest | Original post link

If it can be determined that the result set of a INNER JOIN b will not exceed the total number of records in table a, then performing this join first can indeed effectively reduce the data volume, especially when the associated field in table b is the primary key. This optimization can reduce the data volume for the subsequent LEFT JOIN c operation, thereby improving the overall query efficiency.

translator_bot · June 20, 2024, 3:46pm

| username: FutureDB | Original post link

If you can reduce the data volume as early as possible before joining other tables, it is generally advisable to reduce the table data volume before joining.

translator_bot · June 20, 2024, 3:46pm

| username: Raymond | Original post link

Why is the test result obtained here that a and b join first, and then join with c in version 6.5.3?

translator_bot · June 20, 2024, 3:46pm

| username: 人如其名 | Original post link

The statistics collected immediately after creating the table are inaccurate and have been doubled by the system. Please re-collect the statistics.