Join Reorder Does Not Convert Non-Equi Join Conditions

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: join reorder没有对等值关联条件做转换

| username: 人如其名

[TiDB Usage Environment] Test
[TiDB Version] v7.5.1

The execution efficiency of tpch’s Q5 is relatively slow, partly because the equivalent join condition is not converted. Post: 因优化器问题导致TPCH的Q5语句执行过慢 - TiDB 的问答社区

However, the content of the above post is extensive and does not focus on this issue, so here we simplify the test for this problem.
Create the relevant table structure:

drop table if exists a;
drop table if exists b;
drop table if exists c;

create table a(id int, c1 int, c2 int, primary key (id));
create table b(id int, c1 int, c2 int, primary key (id));
create table c(id int, c1 int, c2 int, primary key (id));

insert into a values (1,1,1), (2,2,2), (3,3,3);
insert into b values (1,1,1), (10,10,10), (11,11,11), (12,12,12);
insert into c values (1,1,1), (2,2,2);

analyze table a, b, c with 1 samplerate;

Check the execution plan of the SQL statement: select count(*) from a, b, c where a.id=b.id and b.id=c.id and c.c1=a.c1:

mysql> explain select count(*) from a, b, c where a.id=b.id and b.id=c.id and c.c1=a.c1;
+------------------------------------+---------+-----------+---------------+-----------------------------------------------------------------------+
| id                                 | estRows | task      | access object | operator info                                                         |
+------------------------------------+---------+-----------+---------------+-----------------------------------------------------------------------+
| StreamAgg_15                       | 1.00    | root      |               | funcs:count(1)->Column#10                                             |
| └─HashJoin_45                      | 2.00    | root      |               | inner join, equal:[eq(test.a.id, test.b.id) eq(test.c.id, test.b.id)] |
|   ├─HashJoin_27(Build)             | 2.00    | root      |               | inner join, equal:[eq(test.c.c1, test.a.c1)]                          |
|   │ ├─TableReader_30(Build)        | 2.00    | root      |               | data:Selection_29                                                     |
|   │ │ └─Selection_29               | 2.00    | cop[tikv] |               | not(isnull(test.c.c1))                                                |
|   │ │   └─TableFullScan_28         | 2.00    | cop[tikv] | table:c       | keep order:false                                                      |
|   │ └─TableReader_33(Probe)        | 3.00    | root      |               | data:Selection_32                                                     |
|   │   └─Selection_32               | 3.00    | cop[tikv] |               | not(isnull(test.a.c1))                                                |
|   │     └─TableFullScan_31         | 3.00    | cop[tikv] | table:a       | keep order:false                                                      |
|   └─TableReader_35(Probe)          | 4.00    | root      |               | data:TableFullScan_34                                                 |
|     └─TableFullScan_34             | 4.00    | cop[tikv] | table:b       | keep order:false                                                      |
+------------------------------------+---------+-----------+---------------+-----------------------------------------------------------------------+
11 rows in set (0.01 sec)

It can be seen that the optimizer first chooses a join c and then associates with b, but the condition for a join c is only: equal:[eq(test.c.c1, test.a.c1)] and does not perceive that a.id=b.id and b.id=c.id imply a.id=c.id, so it does not filter a.id=c.id in advance. Manually optimized to:

mysql> explain select count(*) from a, b, c where a.id=b.id and b.id=c.id and c.c1=a.c1 and a.id=c.id;
+------------------------------------+---------+-----------+---------------+-----------------------------------------------------------------------+
| id                                 | estRows | task      | access object | operator info                                                         |
+------------------------------------+---------+-----------+---------------+-----------------------------------------------------------------------+
| StreamAgg_15                       | 1.00    | root      |               | funcs:count(1)->Column#10                                             |
| └─HashJoin_65                      | 2.00    | root      |               | inner join, equal:[eq(test.a.id, test.b.id) eq(test.c.id, test.b.id)] |
|   ├─HashJoin_47(Build)             | 2.00    | root      |               | inner join, equal:[eq(test.c.c1, test.a.c1) eq(test.c.id, test.a.id)] |
|   │ ├─TableReader_50(Build)        | 2.00    | root      |               | data:Selection_49                                                     |
|   │ │ └─Selection_49               | 2.00    | cop[tikv] |               | not(isnull(test.c.c1))                                                |
|   │ │   └─TableFullScan_48         | 2.00    | cop[tikv] | table:c       | keep order:false                                                      |
|   │ └─TableReader_53(Probe)        | 3.00    | root      |               | data:Selection_52                                                     |
|   │   └─Selection_52               | 3.00    | cop[tikv] |               | not(isnull(test.a.c1))                                                |
|   │     └─TableFullScan_51         | 3.00    | cop[tikv] | table:a       | keep order:false                                                      |
|   └─TableReader_55(Probe)          | 4.00    | root      |               | data:TableFullScan_54                                                 |
|     └─TableFullScan_54             | 4.00    | cop[tikv] | table:b       | keep order:false                                                      |
+------------------------------------+---------+-----------+---------------+-----------------------------------------------------------------------+
11 rows in set (0.00 sec)

In complex SQL statements, it is difficult to manually perform such transformations. It is hoped that the optimizer can optimize the equivalent conversion of related fields.

| username: yytest | Original post link

  1. If you need to improve query performance in the short term, you can consider using TiDB’s Hint mechanism to influence the optimizer’s decisions. Although this is not a long-term solution, it can help you achieve better performance in specific situations. For example, you can use a Hint like the following to force the optimizer to execute the query in the way you expect:
SELECT /*+ JOIN_ORDER(a, c, b) */ COUNT(*)
FROM a, b, c
WHERE a.id = b.id AND b.id = c.id AND c.c1 = a.c1;

In this example, the JOIN_ORDER(a, c, b) Hint tells the optimizer to first join a and c, and then join with b.

| username: FutureDB | Original post link

Currently, TiDB’s optimizer is not that intelligent yet, and often it cannot infer this. It’s better to explicitly specify it in the SQL whenever possible.

| username: Raymond | Original post link

  1. From the tests of MySQL and PG, both support automatically deriving a.id=c.id.

  2. I have a question:

After associating table a with table c, and then associating with table b, it should only need to satisfy b.id = c.id or b.id=a.id. There’s no need to simultaneously check both conditions eq(test.a.id, test.b.id) and eq(test.c.id, test.b.id), right?

| username: 人如其名 | Original post link

a.c1=c.c1 and b.id=c.id
a.c1=c.c1 and b.id=c.id and a.id=b.id
Obviously different, so it’s necessary.

| username: Raymond | Original post link

What I mean is that when joining the statements, the join condition between a and c is a.c1 = c.c1 and a.id = c.id, and then to associate with table b, the condition is b.id = c.id. That’s what I mean.

| username: 人如其名 | Original post link

Then logically, it’s not necessary.

| username: Raymond | Original post link

Yes, that’s what I meant. So I think TiDB made an extra judgment there.

| username: Chad | Original post link

Currently, there are no join_order hints.