Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: TIDB hashjoin与nestjoin【indexjoin】的性能差异
[TiDB Usage Environment] Production Environment
[TiDB Version] V6.5.8
[Encountered Problem: Problem Description and Impact] From the community documentation, I have a preliminary understanding of hash join and index join [用 EXPLAIN 查看 JOIN 查询的执行计划 | PingCAP 文档中心]. Has anyone compared the performance of these two types of joins? Scenario: The number of rows in the associated fields ranges from 100,000 to several million. Theoretically, hash join should definitely perform better [without considering memory consumption], but I am not sure how significant the performance difference is.
You already have the data, just add a hint explain and compare to find out.
I don’t have any data here yet, just wanted to ask if anyone has business data for this kind of scenario, so I can understand the performance differences.
It depends on the data distribution, indexing, and query conditions, etc.
Not necessarily, different scenarios yield different performance.
Generally speaking, when a join operation involves a large amount of data, Hash Join is usually a good choice because it can fully utilize hardware resources to accelerate the processing. When a join operation involves a large table and a small table, Nested Join is usually a better choice because it can effectively use indexes to reduce the amount of scanned data.
Even if it goes wrong, Hash Join is usually an acceptable result, whereas Index Join might lead to complaints from the business.
The SQL optimizer needs to determine the join order of the data tables and decide which join algorithm is the most efficient for a specific SQL statement. First, create a few related tables, test once without indexes, then test again with indexes. The specific outcome also depends on which algorithm the optimizer chooses, which can vary due to differences in cluster deployment.
Intermediate data writing to disk is slow.
You can just test it directly and get the results.
This conclusion makes sense, haha.
For two tables with tens of millions of rows, it is not surprising that the speed of a hash join is twice as fast as an index join.
The number of tables matters.
- Hash Join:
- Principle: Hash Join constructs hash tables for the two datasets to be joined and then matches them using hash values. It is suitable for joining large datasets.
- Advantages: High efficiency for joining large datasets, especially when there is sufficient memory, as it can reduce disk I/O overhead.
- Disadvantages: If the data cannot be fully loaded into memory, it may lead to frequent disk reads and writes, affecting performance.
- Index Join (Nested Loop Join):
- Principle: Index Join traverses the index of one table and uses the index values to find matching rows in another table. It is suitable for joining small datasets.
- Advantages: Suitable when one of the datasets is small, does not require building additional data structures, and can quickly locate matching rows using the index.
- Disadvantages: When the data volume is large, the performance of Index Join may be limited by disk I/O, as each lookup requires disk access.
In practical applications, the choice of join algorithm depends on factors such as data scale, data distribution, and index design. Generally, if both datasets to be joined are relatively large and there is enough memory, Hash Join may be more suitable; whereas if one of the datasets is small and there is appropriate index support, Index Join may perform better.
Let me generate some data for testing.
Share the test results once you have them.
Absolutely, I will conduct stress tests on related fields in four scenarios: 10,000, 100,000, 1,000,000, and 10,000,000.
The larger the amount of associated data, the better the hash join.