Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: tidb index join 原理疑问
May I ask why during the index join process in TiDB, an inner table row hash table is generated? I always thought that for example, a join b on a.id=b.id (assuming a is the outer table), data is taken from table a and then sequentially joined with table b based on the join condition. Why is a hash table generated in table b?
If you consider it as a single database (since all the data is on one node), you definitely won’t be able to understand this description.
Assuming the data is distributed across multiple nodes, the inner worker needs to fetch data from N nodes and then return it. After the data is returned, how can it be associated with the data of the outer worker? The simplest structure is a hash, associating through key values.
Additionally, the returned information needs to be temporarily stored and can only be released after the computation with the outer worker is completed…
You should understand once you read this sentence.
In traditional databases, an outer table record is matched with related records in the inner table by querying the cache one row at a time. However, TiDB uses a distributed architecture with separated storage and computation. If it were to query row by row, it would need to access TiKV, resulting in significant network interactions. Therefore, TiDB has implemented some optimizations:
- Instead of processing row by row, it processes in batches, changing from Row to Chunk. Without considering parallelism, the outer table first retrieves a chunk of data (with the chunk size increasing gradually), then performs filtering, deduplication, sorting, and other operations to form a set of keyRanges. These are then handed over to the TiDB backend to be organized into cop_task tasks and sent to TiKV for execution. However, after TiKV retrieves all the data corresponding to a chunk from the outer table, it cannot determine which records in the chunk correspond to the outer table. Therefore, it uses a hash table to organize the data, allowing the chunk from the outer table to be matched with the data retrieved from the inner table.
In simple terms: Traditional databases do not need to use a hash table because the outer table processes one row of data at a time, matching directly. TiDB, on the other hand, organizes data in chunks and needs to use a method similar to small table hash join.
However, after TiKV obtains all the data corresponding to a chunk of the outer table, this data cannot determine which records correspond to this chunk of the outer table. ----> Can’t this be determined based on the join conditions in the SQL statement?
Isn’t using the outer table for hashing an index hash join?
Determining based on the join condition is done within the hash table.