Filter Data Before Hash Join to Improve SQL Query Speed

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: hashjoin前过滤数据,提高sql查询速度

| username: TiDBer_s7AnS8qP

[TiDB Usage Environment] All environments
[TiDB Version]
[Reproduction Path] SQL query is slow. Example: select * from t left join bt on t.cust_no=bt.cust_no where and bt.cust_name='adad'
[Encountered Problem: Phenomenon and Impact] The query is slow. Before performing the hash join, the bt table is filtered to match only one row, but it reads all data from the t table for the hash join. Is there a way to first filter the bt table based on the query condition and then perform the hash join, as shown in the second image? However, using = does not meet business requirements, and only IN can be used. Is there any way to optimize this?
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]



| username: dba远航 | Original post link

You can consider using HINT.

| username: TiDBer_s7AnS8qP | Original post link

How do I use it?

| username: zhanggame1 | Original post link

First, analyze tables t and bt to see if there are any issues with the execution plan.

| username: 扬仔_tidb | Original post link

Can the LIKE statement here be optimized? Could it be that LIKE is causing the index to be invalid?

| username: 小龙虾爱大龙虾 | Original post link

Try using /+ INL_JOIN(t)/, provided that the table t has an index starting with the customer number.

| username: andone | Original post link

Specify the index with hnit

| username: Kongdom | Original post link

It is recommended to replace left join with inner join.
Because when there are fields from the left join table in the where condition, it is equivalent to an inner join.
The speed of inner join is much faster than left join in practice.

| username: 随缘天空 | Original post link

Try replacing the IN function with EXISTS, for example:
WHERE t.is_delete = 0
AND EXISTS (
SELECT 1
FROM potential_cust_base AS pcb
WHERE pcb.cust_no = t.cust_no
AND pcb.cust_name LIKE ‘新 晶%’
)
AND t.business_unit_code = ‘ISBC’
AND t.status IN (‘pass’);

| username: forever | Original post link

Please also share the table structure. It seems that the index design is not very reasonable.

| username: TiDBer_s7AnS8qP | Original post link

It doesn’t matter if it fails. The problem is that it returns the full data of table t and joins with table bt.

| username: 有猫万事足 | Original post link

In Figure 1, under the condition of a left join, the 61,000 rows in table t that meet the criteria are definitely all needed, right? The bt table is not performing a full table scan either. The bt table has only one row of data, and the entire tableReader operator execution takes less than 1 millisecond. The bottleneck in the entire execution plan lies in reading the 61,000 rows from table t, not in the bt table.

| username: TiDBer_s7AnS8qP | Original post link

I tried it, but it still doesn’t work :joy:. Anyway, filtering data before the join would be much faster.

| username: TiDBer_s7AnS8qP | Original post link

Yes, the same applies to joins. The issue is whether the tableReader for table t can filter based on the qualifying data from table bt instead of returning everything for the join. I’m starting to wonder if the underlying database algorithm isn’t optimized enough.

| username: TiDBer_s7AnS8qP | Original post link

The root cause is not the issue with the index.

| username: zhanggame1 | Original post link

in and exists are equivalent.

| username: 有猫万事足 | Original post link

I roughly understand your point. You mean that the bt table should first find the associated id of the t table, and then directly scan the t table with the conditions, instead of performing a hash join.

The reason why the optimizer didn’t do this, I feel, is due to the statistics. In Figure 1, based on the conditions of your bt table, the estimated data is 54,000.
When using the = join, the estimated data becomes 1 row. Then the query meets the expectation.
When using in, the statistics are still 54,000, so the execution time becomes longer again.

The solution, I feel, is to create an index and then try the index-join hint. It would be more reliable.

| username: TiDBer_s7AnS8qP | Original post link

Yes, this way it’s faster. Thank you, master!

| username: 随缘天空 | Original post link

Yes, because filtering before executing the SQL will significantly reduce the amount of data scanned. Therefore, before execution, try to filter out invalid data using conditions to reduce the scanning scope.

| username: TiDBer_s7AnS8qP | Original post link

Yes, the question is how to filter it :joy: