HashJoin consumes too much memory and does not spill to disk in time

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: HashJoin占用内存过大,落盘不及时

| username: 人如其名

[TiDB Usage Environment] Poc
[TiDB Version] 6.5.0
[Reproduction Path] Components involved: tidb, tikv. When executing the TPCH Q5 statement with default parameters, it was found that the statement-level OOM always occurred under the wrong execution plan.
The Q5 statement is originally intended to perform grouping and aggregation on the result set, etc., for simplification. The test statement and execution plan are as follows (since the tables are relatively small, no statistics were collected for the nation and region tables, while other tables were collected, leading to the wrong execution plan):

select 
        count(*)
from
        customer,
        orders,
        lineitem,
        supplier,
        nation,
        region
where
        c_custkey = o_custkey
        and l_orderkey = o_orderkey
        and l_suppkey = s_suppkey
        and c_nationkey = s_nationkey
        and s_nationkey = n_nationkey
        and n_regionkey = r_regionkey
        and r_name = 'ASIA'
        and o_orderdate >= date '1994-01-01'
        and o_orderdate < date '1994-01-01' + interval '1' year;

+------------------------------------------------+-------------+-----------+----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id                                             | estRows     | task      | access object  | operator info                                                                                                                                                                                                                 |
+------------------------------------------------+-------------+-----------+----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| HashAgg_24                                     | 1.00        | root      |                | funcs:count(1)->Column#48                                                                                                                                                                                                     |
| └─IndexJoin_30                                 | 8738986.41  | root      |                | inner join, inner:TableReader_27, outer key:tpch.orders.o_orderkey, inner key:tpch.lineitem.l_orderkey, equal cond:eq(tpch.orders.o_orderkey, tpch.lineitem.l_orderkey), eq(tpch.supplier.s_suppkey, tpch.lineitem.l_suppkey) |
|   ├─HashJoin_37(Build)                         | 2196527.02  | root      |                | inner join, equal:[eq(tpch.customer.c_custkey, tpch.orders.o_custkey)]                                                                                                                                                        |
|   │ ├─HashJoin_39(Build)                       | 1431131.00  | root      |                | inner join, equal:[eq(tpch.supplier.s_nationkey, tpch.customer.c_nationkey)]                                                                                                                                                  |
|   │ │ ├─HashJoin_41(Build)                     | 25.00       | root      |                | inner join, equal:[eq(tpch.nation.n_nationkey, tpch.supplier.s_nationkey)]                                                                                                                                                    |
|   │ │ │ ├─HashJoin_54(Build)                   | 0.01        | root      |                | inner join, equal:[eq(tpch.region.r_regionkey, tpch.nation.n_regionkey)]                                                                                                                                                      |
|   │ │ │ │ ├─TableReader_59(Build)              | 0.01        | root      |                | data:Selection_58                                                                                                                                                                                                             |
|   │ │ │ │ │ └─Selection_58                     | 0.01        | cop[tikv] |                | eq(tpch.region.r_name, "ASIA")                                                                                                                                                                                                |
|   │ │ │ │ │   └─TableFullScan_57               | 5.00        | cop[tikv] | table:region   | keep order:false, stats:pseudo                                                                                                                                                                                                |
|   │ │ │ │ └─TableReader_56(Probe)              | 25.00       | root      |                | data:TableFullScan_55                                                                                                                                                                                                         |
|   │ │ │ │   └─TableFullScan_55                 | 25.00       | cop[tikv] | table:nation   | keep order:false, stats:pseudo                                                                                                                                                                                                |
|   │ │ │ └─TableReader_61(Probe)                | 100000.00   | root      |                | data:TableFullScan_60                                                                                                                                                                                                         |
|   │ │ │   └─TableFullScan_60                   | 100000.00   | cop[tikv] | table:supplier | keep order:false                                                                                                                                                                                                              |
|   │ │ └─TableReader_63(Probe)                  | 1500000.00  | root      |                | data:TableFullScan_62                                                                                                                                                                                                         |
|   │ │   └─TableFullScan_62                     | 1500000.00  | cop[tikv] | table:customer | keep order:false                                                                                                                                                                                                              |
|   │ └─TableReader_66(Probe)                    | 2196527.02  | root      |                | data:Selection_65                                                                                                                                                                                                             |
|   │   └─Selection_65                           | 2196527.02  | cop[tikv] |                | ge(tpch.orders.o_orderdate, 1994-01-01), lt(tpch.orders.o_orderdate, 1995-01-01)                                                                                                                                              |
|   │     └─TableFullScan_64                     | 14457280.00 | cop[tikv] | table:orders   | keep order:false                                                                                                                                                                                                              |
|   └─TableReader_27(Probe)                      | 2196527.02  | root      |                | data:TableRangeScan_26                                                                                                                                                                                                        |
|     └─TableRangeScan_26                        | 2196527.02  | cop[tikv] | table:lineitem | range: decided by [eq(tpch.lineitem.l_orderkey, tpch.orders.o_orderkey)], keep order:false                                                                                                                                    |
+------------------------------------------------+-------------+-----------+----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
20 rows in set (0.00 sec)

When the parameters are set to default, the memory-related parameters are:

mysql> show variables like 'tidb_mem_quota_query';
+----------------------+------------+
| Variable_name        | Value      |
+----------------------+------------+
| tidb_mem_quota_query | 1073741824 |
+----------------------+------------+
1 row in set (0.01 sec)
mysql> show variables like '%oom%';
+--------------------------------+--------+
| Variable_name                  | Value  |
+--------------------------------+--------+
| tidb_enable_tmp_storage_on_oom | ON     |
| tidb_mem_oom_action            | CANCEL |
+--------------------------------+--------+
2 rows in set (0.00 sec)

mysql> show config where name like 'tmp-storage%';
+------+---------------------+-------------------+-----------------------------------------------------------------+
| Type | Instance            | Name              | Value                                                           |
+------+---------------------+-------------------+-----------------------------------------------------------------+
| tidb | 192.168.31.201:4002 | tmp-storage-path  | /tmp/1000_tidb/MC4wLjAuMDo0MDAyLzAuMC4wLjA6MTAwODI=/tmp-storage |
| tidb | 192.168.31.201:4002 | tmp-storage-quota | -1                                                              |
+------+---------------------+-------------------+-----------------------------------------------------------------+
2 rows in set (0.01 sec)

Executing the statement:

mysql> explain analyze select          count(*) from         customer,         orders,         lineitem,         supplier,         nation,         region where         c_custkey = o_custkey         and l_orderkey = o_orderkey
       and l_suppkey = s_suppkey         and c_nationkey = s_nationkey         and s_nationkey = n_nationkey         and n_regionkey = r_regionkey         and r_name = 'ASIA'         and o_orderdate >= date '1994-01-01'
 and o_orderdate < date '1994-01-01' + interval '1' year;
ERROR 1105 (HY000): context canceled
mysql> explain analyze select          count(*) from         customer,         orders,         lineitem,         supplier,         nation,         region where         c_custkey = o_custkey         and l_orderkey = o_orderkey
       and l_suppkey = s_suppkey         and c_nationkey = s_nationkey         and s_nationkey = n_nationkey         and n_regionkey = r_regionkey         and r_name = 'ASIA'         and o_orderdate >= date '1994-01-01'
 and o_orderdate < date '1994-01-01' + interval '1' year;
ERROR 2013 (HY000): Lost connection to MySQL server during query
No connection. Trying to reconnect...
Connection id:    415
Current database: tpch

ERROR 1105 (HY000): Out Of Memory Quota![conn_id=293158387267404191]

As you can see, OOM occurred. Here, I have already set up disk spilling, but it was still canceled!

Modify the parameter set global tidb_mem_oom_action=LOG; to only log when memory exceeds, without killing the statement. However, according to the OOM priority, it will still try to spill to disk as much as possible. Here, observe how much memory the statement actually uses during execution. The explain analyze execution result is as follows:

+------------------------------------------------+-------------+------------+-----------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+-----------+
| id                                             | estRows     | actRows    | task      | access object  | execution info                                                                                                                                                                                                                                                                                                                                                                                                              | operator info                                                                                                                                                                                                                 | memory    | disk      |
+------------------------------------------------+-------------+------------+-----------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+-----------+
| HashAgg_24                                     | 1.00        | 1          | root      |                | time:40m16.7s, loops:2, partial_worker:{wall_time:40m16.708802251s, concurrency:5, task_num:72, tot_wait:3h21m23.538992743s, tot_exec:2.902767ms, tot_time:3h21m23.543541588s, max:40m16.708722824s, p95:40m16.708722824s}, final_worker:{wall_time:40m16.709129743s, concurrency:5, task_num:5, tot_wait:3h21m23.543673912s, tot_exec:342.368µs, tot_time:3h21m23.544019075s, max:40m16.709041758s, p95:40m16.709041758s}  | funcs:count(1)->Column#48                                                                                                                                                                                                     | 165.3 KB  | N/A       |
| └─IndexJoin_30                                 | 8738986.41  | 72985      | root      |                | time:40m16.7s, loops:73, inner:{total:12m43.5s, concurrency:5, task:71508, construct:10m57.2s, fetch:1m44.8s, build:1.33s}, probe:2m15s                                                                                                                                                                                                                                                                                     | inner join, inner:TableReader_27, outer key:tpch.orders.o_orderkey, inner key:tpch.lineitem.l_orderkey, equal cond:eq(tpch.orders.o_orderkey, tpch.lineitem.l_orderkey), eq(tpch.supplier.s_suppkey, tpch.lineitem.l_suppkey) | 10.2 MB   | N/A       |
|   ├─HashJoin_37(Build)                         | 2196527.02  | 1830400706 | root      |                | time:39m51.8s, loops:1787505, build_hash_table:{total:15m1s, fetch:11m33.8s, build:3m27.3s}, probe:{concurrency:5, total:3h21m17.6s, max:40m16.7s, probe:2h6m12.3s, fetch:1h15m5.3s}                                                                                                                                                                                                                                        | inner join, equal:[eq(tpch.customer.c_custkey, tpch.orders.o_custkey)]                                                                                                                                                        | 17.9 GB   | 44.8 GB   |
|   │ ├─HashJoin_39(Build)                       | 1431131.00  | 1203307921 | root      |                | time:12m31.8s, loops:1175109, build_hash_table:{total:15.5ms, fetch:13.4ms, build:2.11ms}, probe:{concurrency:5, total:1h15m1.6s, max:15m1s, probe:1h15m1.3s, fetch:245ms}                                                                                                                                                                                                                                                  | inner join, equal:[eq(tpch.supplier.s_nationkey, tpch.customer.c_nationkey)]                                                                                                                                                  | 846.4 KB  | 782.7 KB  |
|   │ │ ├─HashJoin_41(Build)                     | 25.00       | 20037      | root      |                | time:14.5ms, loops:23, build_hash_table:{total:2.36ms, fetch:2.35ms, build:10.1µs}, probe:{concurrency:5, total:75.2ms, max:15.2ms, probe:15ms, fetch:60.2ms}                                                                                                                                                                                                                                                               | inner join, equal:[eq(tpch.nation.n_nationkey, tpch.supplier.s_nationkey)]                                                                                                                                                    | 1.59 KB   | 120 Bytes |
|   │ │ │ ├─HashJoin_54(Build)                   | 0.01        | 5          | root      |                | time:2.34ms, loops:2, build_hash_table:{total:2.14ms, fetch:2.13ms, build:10.1µs}, probe:{concurrency:5, total:10.7ms, max:2.22ms, probe:66.6µs, fetch:10.6ms}                                                                                                                                                                                                                                                              | inner join, equal:[eq(tpch.region.r_regionkey, tpch.nation.n_regionkey)]                                                                                                                                                      | 31.1 KB   | 36 Bytes  |
|   │ │ │ │ ├─TableReader_59(Build)              | 0.01        | 1          | root      |                | time:2.11ms, loops:2, cop_task: {num: 1, max: 2.6ms, proc_keys: 5, rpc_num: 1, rpc_time: 2.57ms, copr_cache_hit_ratio: 0.00, distsql_concurrency: 15}                                                                                                                                                                                                                                                                       | data:Selection_58                                                                                                                                                                                                             | 312 Bytes | N/A       |
|   │ │ │ │ │ └─Selection_58                     | 0.01        | 1          | cop[tikv] |                | tikv_task:{time:0s, loops:1}, scan_detail: {total_process_keys: 5, total_process_keys_size: 559, total_keys: 6, get_snapshot_time: 3.1µs, rocksdb: {key_skipped_count: 5, block: {cache_hit_count: 4}}}                                                                                                                                                                                                                     | eq(tpch.region.r_name, "ASIA")                                                                                                                                                                                                | N/A       | N/A       |
|   │ │ │ │ │   └─TableFullScan_57               | 5.00        | 5          | cop[tikv] | table:region   | tikv_task:{time:0s, loops:1}                                                                                                                                                                                                                                                                                                                                                                                                | keep order:false, stats:pseudo                                                                                                                                                                                                | N/A       | N/A       |
|   │ │ │ │ └─TableReader_56(Probe)              | 25.00       | 25         | root      |                | time:469.1µs, loops:2, cop_task: {num: 1, max: 961.6µs, proc_keys: 25, rpc_num: 1, rpc_time: 938.3µs, copr_cache_hit_ratio: 0.00, distsql_concurrency: 15}                                                                                                                                                                                                                                                                  | data:TableFullScan_55                                                                                                                                                                                                         | 679 Bytes | N/A       |
|   │ │ │ │   └─TableFullScan_55                 | 25.00       | 25         | cop[tikv] | table:nation   | tikv_task:{time:0s, loops:1}, scan_detail: {total_process_keys: 25, total_process_keys_size: 3109, total_keys: 26, get_snapshot_time: 7.15µs, rocksdb: {key_skipped_count: 25, block: {cache_hit_count: 4}}}                                                                                                                                                                                                                | keep order:false, stats:pseudo                                                                                                                                                                                                | N/A       | N/A       |
|   │ │ │ └─TableReader_61(Probe)                | 100000.00   | 100000     | root      |                | time:11.8ms, loops:100, cop_task: {num: 10, max: 2.55ms, min: 770.1µs, avg: 1.29ms, p95: 2.55ms, max_proc_keys: 480, p95_proc_keys: 480, rpc_num: 10, rpc_time: 12.8ms, copr_cache_hit_ratio: 0.90, distsql_concurrency: 15}                                                                                                                                                                                                | data:TableFullScan_60                                                                                                                                                                                                         | 964.4 KB  | N/A       |
|   │ │ │   └─TableFullScan_60                   | 100000.00   | 100000     | cop[tikv] | table:supplier | tikv_task:{proc max:40ms, min:1ms, avg: 11ms, p80:20ms, p95:40ms, iters:137, tasks:10}, scan_detail: {total_process_keys: 480, total_process_keys_size: 87745, total_keys: 481, get_snapshot_time: 254.5µs, rocksdb: {key_skipped_count: 480, block: {cache_hit_count: 6}}}                                                                                                                                                 | keep order:false                                                                                                                                                                                                              | N/A       | N/A       |
|   │ │ └─TableReader_63(Probe)                  | 1500000.00  | 1500000    | root      |                | time:153ms, loops:1468, cop_task: {num: 54, max: 31ms, min: 242
| username: cs58_dba | Original post link

It is more appropriate to run such a large table in the CDH big data environment.

| username: Lucien-卢西恩 | Original post link

It might be due to this reason. You can verify it by turning off tidb_enable_tmp_storage_on_oom and trying again. https://docs.pingcap.com/zh/tidb/dev/system-variables#tidb_mem_oom_action-从-v610-版本开始引入

| username: 人如其名 | Original post link

This is unrelated, the red-circled cannot reuse temporary disk should mean that if there is one, it will be used first, and if not, it will cancel. If tidb_enable_tmp_storage_on_oom=OFF, it will definitely be canceled, which is even less in line with the requirements. Here, I think it should first write to disk, and only cancel when there is no space on the disk (it will not use more memory than tidb_mem_quota_query). In my actual test, the disk was not full.

The hash table related to hash join should be here: tidb/executor/join.go at 706c3fa3c526cdba5b3e9f066b1a568fb96c56e3 · pingcap/tidb · GitHub
MemTracker.FallbackOldAndSetNewAction is a method that adds new actions to the action queue on the current object and then reorders them based on the priority attribute of the actions, with higher priority actions being executed first. When disk spilling is enabled, its priority is higher than cancel. The priority order is: tidb/util/memory/action.go at 706c3fa3c526cdba5b3e9f066b1a568fb96c56e3 · pingcap/tidb · GitHub
If throttling is enabled, it will throttle first, then spill to disk, and then (if oom-action=log, it will only log a warning, if cancel, it will directly kill). Therefore, the logic of spilling to disk and canceling does not conflict; it should spill to disk first, and cancel only if it cannot spill. The issue here seems to be that the batch size for spilling the hash table is too large, causing insufficient memory and thus being killed. However, I’m not sure if this is the case, and if it is, whether there should be some optimization at the code level.

| username: Lucien-卢西恩 | Original post link

If the disk is not full, it should not cancel. I will test it later.

| username: 人如其名 | Original post link

Here are the test results, which are the results of repeated tests.
It might be difficult for you to reproduce because it requires the nation and region tables to not collect statistical information (pseudo).

If you need to reproduce, you should clear the statistical information of these two tables (you can clear the statistical information by rebuilding the tables).

| username: 人如其名 | Original post link

Mark, the OOM here is mainly due to an internal cross join. You can see the issue here: Turning on disk placement causes the tidb node to oom · Issue #40500 · pingcap/tidb · GitHub. It should be fixed in the future.