Performance Differences Between TiDB Hash Join and Nested Join (Index Join)

translator_bot · June 21, 2024, 1:40am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIDB hashjoin与nestjoin【indexjoin】的性能差异

| username: residentevil

[TiDB Usage Environment] Production Environment
[TiDB Version] V6.5.8
[Encountered Problem: Problem Description and Impact] From the community documentation, I have a preliminary understanding of hash join and index join [用 EXPLAIN 查看 JOIN 查询的执行计划 | PingCAP 文档中心]. Has anyone compared the performance of these two types of joins? Scenario: The number of rows in the associated fields ranges from 100,000 to several million. Theoretically, hash join should definitely perform better [without considering memory consumption], but I am not sure how significant the performance difference is.

translator_bot · June 21, 2024, 1:40am

| username: onlyacat | Original post link

You already have the data, just add a hint explain and compare to find out.

translator_bot · June 21, 2024, 1:40am

| username: residentevil | Original post link

I don’t have any data here yet, just wanted to ask if anyone has business data for this kind of scenario, so I can understand the performance differences.

translator_bot · June 21, 2024, 1:40am

| username: redgame | Original post link

It depends on the data distribution, indexing, and query conditions, etc.

translator_bot · June 21, 2024, 1:40am

| username: TiDBer_aaO4sU46 | Original post link

Not necessarily, different scenarios yield different performance.

translator_bot · June 21, 2024, 1:40am

| username: TiDBer_ivan0927 | Original post link

Generally speaking, when a join operation involves a large amount of data, Hash Join is usually a good choice because it can fully utilize hardware resources to accelerate the processing. When a join operation involves a large table and a small table, Nested Join is usually a better choice because it can effectively use indexes to reduce the amount of scanned data.

translator_bot · June 21, 2024, 1:40am

| username: 江湖故人 | Original post link

Even if it goes wrong, Hash Join is usually an acceptable result, whereas Index Join might lead to complaints from the business.

translator_bot · June 21, 2024, 1:40am

| username: TiDBer_小阿飞 | Original post link

The SQL optimizer needs to determine the join order of the data tables and decide which join algorithm is the most efficient for a specific SQL statement. First, create a few related tables, test once without indexes, then test again with indexes. The specific outcome also depends on which algorithm the optimizer chooses, which can vary due to differences in cluster deployment.

translator_bot · June 21, 2024, 1:40am

| username: h5n1 | Original post link

Intermediate data writing to disk is slow.

translator_bot · June 21, 2024, 1:40am

| username: DBAER | Original post link

You can just test it directly and get the results.

translator_bot · June 21, 2024, 1:40am

| username: residentevil | Original post link

This conclusion makes sense, haha.

translator_bot · June 21, 2024, 1:40am

| username: Raymond | Original post link

For two tables with tens of millions of rows, it is not surprising that the speed of a hash join is twice as fast as an index join.

translator_bot · June 21, 2024, 1:40am

| username: YuchongXU | Original post link

The number of tables matters.

translator_bot · June 21, 2024, 1:40am

| username: 哈喽沃德 | Original post link

Hash Join:

Principle: Hash Join constructs hash tables for the two datasets to be joined and then matches them using hash values. It is suitable for joining large datasets.
Advantages: High efficiency for joining large datasets, especially when there is sufficient memory, as it can reduce disk I/O overhead.
Disadvantages: If the data cannot be fully loaded into memory, it may lead to frequent disk reads and writes, affecting performance.

Index Join (Nested Loop Join):

Principle: Index Join traverses the index of one table and uses the index values to find matching rows in another table. It is suitable for joining small datasets.
Advantages: Suitable when one of the datasets is small, does not require building additional data structures, and can quickly locate matching rows using the index.
Disadvantages: When the data volume is large, the performance of Index Join may be limited by disk I/O, as each lookup requires disk access.

In practical applications, the choice of join algorithm depends on factors such as data scale, data distribution, and index design. Generally, if both datasets to be joined are relatively large and there is enough memory, Hash Join may be more suitable; whereas if one of the datasets is small and there is appropriate index support, Index Join may perform better.