The LEADING hint for table join order should support multi-table association priority and be compatible with the selection of Join algorithms in the physical optimization phase

translator_bot · June 21, 2024, 8:58pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: Hint中表连接顺序LEADING应支持多表关联优先级和兼容物理优化阶段Join算法的选择

| username: 人如其名

[Problem Scenario Involved in the Requirement] Inconvenience in using leading during multi-table joins and lack of support for choosing join algorithms during the physical optimization phase.

[Expected Requirement Behavior] Support leading multi-table priority and compatibility with choosing join algorithms (i.e., MERGE_JOIN, INL_JOIN, INL_HASH_JOIN, HASH_JOIN).

Hope leading supports multi-table association priority

Currently, TiDB’s join reorder capability still needs further enhancement. We often encounter situations where we need to manually specify the join order to optimize statements. For example:

TiDB 的问答社区 – 15 Jun 23

因优化器问题导致TPCH的Q2语句执行过慢

🪐 TiDB 技术问题

【 TiDB 使用环境】测试【 TiDB 版本】v6.5.2 在测试TPCH的Q2语句时候发现执行的特别慢，主要原因有2点： 1、子查询虽然解关联但是没有优化成窗口函数形式，导致对相同的表做了重复的HashJoin数据查询。 2、手工修改SQL语句优化成窗口函数形式，执行效率得到大幅提升，但是并非最优，当指定hint时性能进一步提升，但是cost成本评估存在问题。因字数限制，这里只能贴图，具体执行计划可以看附件。 tpch_q2.txt (254.1 KB) ...

阅读时间: 15 mins 🕑 赞: 3 ❤

These are due to the insufficient join reorder algorithm, which did not choose a better join order. Therefore, we need to manually specify the order. However, when using leading, we found that the leading function does not support priority. For example:
select * from a, b, c where a.id = b.id and a.id = c.id
The join orders of the tables are:
(a, b), c
c, (a, b)
(a, c), b
b, (a, c)
But the leading syntax does not support priority selection. Therefore, can the leading be enhanced to support the following forms:
leading(a, (b, c))
leading((a, b), (c, d))
Can leading support multi-table association priority?

Hope leading is compatible with choosing join algorithms (i.e., MERGE_JOIN, INL_JOIN, INL_HASH_JOIN, HASH_JOIN)

Since leading occurs in the logical optimization phase, I understand that this does not conflict with physical optimization (choice of join algorithm). Why has it not been compatible with the choice of join algorithms?

translator_bot · June 21, 2024, 8:58pm

| username: 有猫万事足 | Original post link

Today, after listening to Mr. Mao’s explanation, I feel that your suggestion is indeed very practical. After confirming that leading does not support priority, I wondered if it could be achieved using STRAIGHT_JOIN along with the order of tables in the FROM clause. At least STRAIGHT_JOIN does not have the limitation of not being able to specify the join algorithm.

I tried it as shown in the picture above, and it seems to partially meet your two requirements. Since I don’t have a TPCH environment, I’m not sure if it will perform well in the TPCH scenario.

translator_bot · June 21, 2024, 8:58pm

| username: Billmay表妹 | Original post link

Is it possible to create a GitHub issue to document and report this problem?

translator_bot · June 21, 2024, 8:58pm

| username: 大钢镚13146 | Original post link

OceanBase supports this kind of multi-table join syntax.

translator_bot · June 21, 2024, 8:58pm

| username: 人如其名 | Original post link

Thank you for the reply. Leading belongs to semi-automatic mode, while straight_join is fully manual. In more complex query scenarios, if there are only local sequence issues, leading can allow for local manual adjustments while the optimizer still handles the overall adjustments, thus reducing complexity. For straight_join, specifying the entire join order can be particularly difficult for DBAs, especially in complex situations with many table associations. Moreover, for join sequences like (a, b), (c, d), it is hard to achieve the desired result without modifying the SQL, even with straight_join (many complex SQLs might be generated through assembly and do not allow modification of the SQL order, which limits the use of straight_join).

Additionally, I believe the best expression capability in this area belongs to DB2’s hint, which uses XML to express almost all scenarios that the optimizer can handle. For our simple (a, b), (c, d) association method, it can be mapped as follows:

<HSJOIN>
	<JOIN>
	    <ACCESS TABLE='a'/>
	    <LPREFETCH TABLE='b' INDEX='b_idx'/>
	</JOIN>

	<NLJOIN>
	  <ACCESS TABLE='c' FIRST='TRUE'/>
	  <IXSCAN TABLE='d'/>
	</NLJOIN>
</HSJOIN>

Explanation:

a and b are associated, the keyword allows the optimizer to choose the join method (hash join, nljoin, etc.) during the physical optimization phase, and the optimizer decides which table is the outer table. For table b, it specifies using the b_idx index and efficiently fetching all index RIDs back to the table through LPREFETCH (here LPREFETCH specifies the data request mode).
c and d are associated, the keyword forces the use of the NLJOIN algorithm, with table c as the outer table executed first (first=true), and tells table d to use an index, but lets the optimizer choose an efficient index.
Finally, (a, b) hash join (c, d) tells the optimizer to use hash join overall and choose the inner and outer tables itself.
Through the above hint, a complex statement involving tables a, b, c, and d can follow the execution plan according to our specified order, algorithm, and even data request method.

This is just one of the simplest and most common scenarios. The XML expression capability is very strong, and DB2 provides a complete XML Schema definition for optimizer hints, which is very powerful. MySQL’s hint expression capability is too weak and only compatible with MySQL’s limited scope. TiDB can also appropriately learn from DB2’s style and define its own set of expression mechanisms.