Performance Differences Between TiDB Hash Join and Nested Join (Index Join)

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIDB hashjoin与nestjoin【indexjoin】的性能差异

| username: residentevil

[TiDB Usage Environment] Production Environment
[TiDB Version] V6.5.8
[Encountered Problem: Problem Description and Impact] From the community documentation, I have a preliminary understanding of hash join and index join [用 EXPLAIN 查看 JOIN 查询的执行计划 | PingCAP 文档中心]. Has anyone compared the performance of these two types of joins? Scenario: The number of rows in the associated fields ranges from 100,000 to several million. Theoretically, hash join should definitely perform better [without considering memory consumption], but I am not sure how significant the performance difference is.

| username: onlyacat | Original post link

You already have the data, just add a hint explain and compare to find out.

| username: residentevil | Original post link

I don’t have any data here yet, just wanted to ask if anyone has business data for this kind of scenario, so I can understand the performance differences.

| username: redgame | Original post link

It depends on the data distribution, indexing, and query conditions, etc.

| username: TiDBer_aaO4sU46 | Original post link

Not necessarily, different scenarios yield different performance.

| username: TiDBer_ivan0927 | Original post link

Generally speaking, when a join operation involves a large amount of data, Hash Join is usually a good choice because it can fully utilize hardware resources to accelerate the processing. When a join operation involves a large table and a small table, Nested Join is usually a better choice because it can effectively use indexes to reduce the amount of scanned data.

| username: 江湖故人 | Original post link

Even if it goes wrong, Hash Join is usually an acceptable result, whereas Index Join might lead to complaints from the business.

| username: TiDBer_小阿飞 | Original post link

The SQL optimizer needs to determine the join order of the data tables and decide which join algorithm is the most efficient for a specific SQL statement. First, create a few related tables, test once without indexes, then test again with indexes. The specific outcome also depends on which algorithm the optimizer chooses, which can vary due to differences in cluster deployment.

| username: h5n1 | Original post link

Intermediate data writing to disk is slow.

| username: DBAER | Original post link

You can just test it directly and get the results.

| username: residentevil | Original post link

This conclusion makes sense, haha.

| username: Raymond | Original post link

For two tables with tens of millions of rows, it is not surprising that the speed of a hash join is twice as fast as an index join.

| username: YuchongXU | Original post link

The number of tables matters.

| username: 哈喽沃德 | Original post link

  1. Hash Join:
  • Principle: Hash Join constructs hash tables for the two datasets to be joined and then matches them using hash values. It is suitable for joining large datasets.
  • Advantages: High efficiency for joining large datasets, especially when there is sufficient memory, as it can reduce disk I/O overhead.
  • Disadvantages: If the data cannot be fully loaded into memory, it may lead to frequent disk reads and writes, affecting performance.
  1. Index Join (Nested Loop Join):
  • Principle: Index Join traverses the index of one table and uses the index values to find matching rows in another table. It is suitable for joining small datasets.
  • Advantages: Suitable when one of the datasets is small, does not require building additional data structures, and can quickly locate matching rows using the index.
  • Disadvantages: When the data volume is large, the performance of Index Join may be limited by disk I/O, as each lookup requires disk access.

In practical applications, the choice of join algorithm depends on factors such as data scale, data distribution, and index design. Generally, if both datasets to be joined are relatively large and there is enough memory, Hash Join may be more suitable; whereas if one of the datasets is small and there is appropriate index support, Index Join may perform better.

| username: TiDBer_5Vo9nD1u | Original post link

Learned.

| username: residentevil | Original post link

Let me generate some data for testing.

| username: kelvin | Original post link

Here to learn.

| username: 小于同学 | Original post link

Share the test results once you have them.

| username: residentevil | Original post link

Absolutely, I will conduct stress tests on related fields in four scenarios: 10,000, 100,000, 1,000,000, and 10,000,000.

| username: dba远航 | Original post link

The larger the amount of associated data, the better the hash join.