Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.Original topic: 当HashJoin的BuildSide过大时容易OOM

When the number of records on the BuildSide of HashJoin is very large, SQL-level OOM is likely to occur, even if disk spilling is enabled.
-- Table structure (tpch table)
mysql> show create table customer \G
*************************** 1. row ***************************
Table: customer
Create Table: CREATE TABLE `customer` (
`C_CUSTKEY` bigint(20) NOT NULL,
`C_NAME` varchar(25) NOT NULL,
`C_ADDRESS` varchar(40) NOT NULL,
`C_NATIONKEY` bigint(20) NOT NULL,
`C_PHONE` char(15) NOT NULL,
`C_ACCTBAL` decimal(15,2) NOT NULL,
`C_MKTSEGMENT` char(10) NOT NULL,
`C_COMMENT` varchar(117) NOT NULL,
PRIMARY KEY (`C_CUSTKEY`) /*T![clustered_index] CLUSTERED */,
KEY `idx1` (`C_PHONE`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin
1 row in set (0.00 sec)
mysql> show create table orders \G
*************************** 1. row ***************************
Table: orders
Create Table: CREATE TABLE `orders` (
`O_ORDERKEY` bigint(20) NOT NULL,
`O_CUSTKEY` bigint(20) NOT NULL,
`O_ORDERSTATUS` char(1) NOT NULL,
`O_TOTALPRICE` decimal(15,2) NOT NULL,
`O_ORDERDATE` date NOT NULL,
`O_ORDERPRIORITY` char(15) NOT NULL,
`O_CLERK` char(15) NOT NULL,
`O_SHIPPRIORITY` bigint(20) NOT NULL,
`O_COMMENT` varchar(79) NOT NULL,
PRIMARY KEY (`O_ORDERKEY`) /*T![clustered_index] CLUSTERED */,
KEY `idx1` (`O_CUSTKEY`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin
1 row in set (0.00 sec)
-- Data volume of customer table (15 million)
mysql> select count(*) from customer;
+----------+
| count(*) |
+----------+
| 15000000 |
+----------+
1 row in set (0.57 sec)
-- Data volume of orders table (150 million)
mysql> select count(*) from orders;
+-----------+
| count(*) |
+-----------+
| 150000000 |
+-----------+
1 row in set (4.34 sec)
-- Database version
mysql> select version();
+--------------------+
| version() |
+--------------------+
| 5.7.25-TiDB-v7.0.0 |
+--------------------+
1 row in set (0.00 sec)
-- OOM action behavior
mysql> show variables like 'tidb_mem_oom_action';
+---------------------+--------+
| Variable_name | Value |
+---------------------+--------+
| tidb_mem_oom_action | CANCEL |
+---------------------+--------+
1 row in set (0.00 sec)
mysql> show variables like 'tidb_enable_rate_limit_action';
+-------------------------------+-------+
| Variable_name | Value |
+-------------------------------+-------+
| tidb_enable_rate_limit_action | OFF |
+-------------------------------+-------+
1 row in set (0.00 sec)
-- Disk spilling parameter
mysql> show variables like 'tidb_enable_tmp_storage_on_oom';
+--------------------------------+-------+
| Variable_name | Value |
+--------------------------------+-------+
| tidb_enable_tmp_storage_on_oom | ON |
+--------------------------------+-------+
1 row in set (0.00 sec)
-- SQL-level memory control is set to 1GB
mysql> show variables like 'tidb_mem_quota_query';
+----------------------+------------+
| Variable_name | Value |
+----------------------+------------+
| tidb_mem_quota_query | 1073741824 |
+----------------------+------------+
1 row in set (0.01 sec)
Using the orders table as the BuildSide of hashJoin
mysql> select count(distinct O_CUSTKEY) from orders;
+---------------------------+
| count(distinct O_CUSTKEY) |
+---------------------------+
| 9999832 |
+---------------------------+
1 row in set (23.24 sec)
mysql> explain select /*+ HASH_JOIN_BUILD(b) */ count(*) from customer a,orders b where a.c_custkey=b.O_custkey and a.C_PHONE='11-746-264-1304';
+-------------------------------+--------------+-----------+--------------------------------+------------------------------------------------------------------------------+
| id | estRows | task | access object | operator info |
+-------------------------------+--------------+-----------+--------------------------------+------------------------------------------------------------------------------+
| StreamAgg_9 | 1.00 | root | | funcs:count(1)->Column#18 |
| └─HashJoin_61 | 15.20 | root | | inner join, equal:[eq(tpch100.customer.c_custkey, tpch100.orders.o_custkey)] |
| ├─IndexReader_38(Build) | 150000000.00 | root | | index:IndexFullScan_37 |
| │ └─IndexFullScan_37 | 150000000.00 | cop[tikv] | table:b, index:idx1(O_CUSTKEY) | keep order:false |
| └─IndexReader_34(Probe) | 1.00 | root | | index:IndexRangeScan_33 |
| └─IndexRangeScan_33 | 1.00 | cop[tikv] | table:a, index:idx1(C_PHONE) | range:["11-746-264-1304","11-746-264-1304"], keep order:false |
+-------------------------------+--------------+-----------+--------------------------------+------------------------------------------------------------------------------+
6 rows in set (0.00 sec)
-- Executing this statement always results in Out of Memory Quota! (The distinct value of the join field o_custkey: 9999832)
mysql> explain analyze select /*+ HASH_JOIN_BUILD(b) */ count(*) from customer a,orders b where a.c_custkey=b.O_custkey and a.C_PHONE='11-746-264-1304';
ERROR 1105 (HY000): Out Of Memory Quota![conn=3978033069293973]
-- Try using a field with low duplication on the build table to see if OOM still occurs
-- Here, O_ORDERDATE is used as the join field, with only about 2k distinct values.
mysql> select count(distinct O_ORDERDATE) from orders;
+-----------------------------+
| count(distinct O_ORDERDATE) |
+-----------------------------+
| 2406 |
+-----------------------------+
1 row in set (2.16 sec)
mysql> explain analyze select /*+ HASH_JOIN_BUILD(b) */ count(*) from customer a,orders b where a.c_custkey=b.O_ORDERDATE and a.C_PHONE='11-746-264-1304';
ERROR 1105 (HY000): Out Of Memory Quota![conn=3978033069293973]
mysql>
OOM still occurs, so whether the join field (key) on the buildSide table has high or low duplication, there is a risk of OOM.
In the code for generating the hash_table on the buildSide of hashJoin, it can be seen that each key and row pointer is placed into a hashtable. Code location:
Moreover, through the method func (ht unsafeHashTable) Put(hashKey uint64, rowPtr chunk.RowPtr), it can be seen that the lower the duplication of the key, the slower the speed of generating this hashtable (because the rowPtr of records with the same key needs to be appended to the slice).
This hashtable records the key and corresponding row pointer, without disk spilling behavior. The data (stored in rowContainer) has disk spilling behavior. However, each time a row record is put, it adds a type entry struct {
ptr chunk.RowPtr
next entry
} byte, a total of 16 bytes of row pointer record and deduplicated hashkey record (8 bytes). Therefore, even if we do not consider the data and cached data from kvrequest (all 1GB memory is used for this hashtable), it can only cache approximately 1GB102410241024/16=67108864 (about 67.1 million when there are many duplicate keys), 1GB102410241024/24=44739242 (about 44.7 million when there are almost no duplicate keys) records’ pointers.
To verify this, buildSide table (key=O_ORDERDATE, low duplication) records 80 million, 70 million, 60 million, observe whether OOM occurs:
mysql> explain analyze select /*+ HASH_JOIN_BUILD(b) */ count(*) from customer a,(select * from orders limit 80000000) b where a.c_custkey=b.O_ORDERDATE and a.C_PHONE='11-746-264-1304';
ERROR 1105 (HY000): Out Of Memory Quota![conn=3978033069293973]
mysql> explain analyze select /*+ HASH_JOIN_BUILD(b) */ count(*) from customer a,(select * from orders limit 70000000) b where a.c_custkey=b.O_ORDERDATE and a.C_PHONE='11-746-264-1304';
ERROR 1105 (HY000): Out Of Memory Quota![conn=3978033069293973]
mysql>
mysql> explain analyze select /*+ HASH_JOIN_BUILD(b) */ count(*) from customer a,(select * from orders limit 60000000) b where a.c_custkey=b.O_ORDERDATE and a.C_PHONE='11-746-264-1304';
+------------------------------------+-------------+----------+-----------+------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------+-----------+---------+
| id | estRows | actRows | task | access object | execution info | operator info | memory | disk |
+------------------------------------+-------------+----------+-----------+------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------+-----------+---------+
| HashAgg_14 | 1.00 | 1 | root | | time:16.2s, loops:2, RRU:149686.360502, WRU:0.000000, partial_worker:{wall_time:16.161443238s, concurrency:5, task_num:0, tot_wait:1m20.806052767s, tot_exec:0s, tot_time:1m20.806055402s, max:16.161212139s, p95:16.161212139s}, final_worker:{wall_time:16.161945916s, concurrency:5, task_num:0, tot_wait:1m20.806094127s, tot_exec:1.556µs, tot_time:1m20.806097792s, max:16.161223448s, p95:16.161223448s} | funcs:count(1)->Column#18 | 6.15 KB | N/A |
| └─HashJoin_16 | 24937.66 | 0 | root | | time:16.2s, loops:1, build_hash_table:{total:16.2s, fetch:4.74s, build:11.4s}, probe:{concurrency:5, total:1m20.8s, max:16.2s, probe:13.1µs, fetch:1m20.8s} | inner join, equal:[eq(Column#19, Column#20)] | 1019.3 MB | 1.34 GB |
| ├─Projection_20(Build) | 60000000.00 | 60000000 | root | | time:6.85s, loops:58730, Concurrency:5 | cast(tpch100.orders.o_orderdate, double BINARY)->Column#20 | 87.8 KB | N/A |
| │ └─Limit_21 | 60000000.00 | 60000000 | root | | time:340.6ms, loops:58730 | offset:0, count:60000000 | N/A | N/A |
| │ └─TableReader_25 | 60000000.00 | 60000965 | root | | time:283.2ms, loops:58729, cop_task: {num: 2151, max: 108.6ms, min: 363.8µs, avg: 20.1ms, p95: 51.2ms, max_proc_keys: 50144, p95_proc_keys: 50144, tot_proc: 29.7s, tot_wait: 821.9ms, rpc_num: 2151, rpc_time: 43.1s, copr_cache_hit_ratio: 0.00, build_task_duration: 277.5µs, max_distsql_concurrency: 15} | data:Limit_24 | 6.14 MB | N/A |
| │ └─Limit_24 | 60000000.00 | 60004037 | cop[tikv] | | tikv_task:{proc max:98ms, min:0s, avg: 12.6ms, p80:22ms, p95:36ms, iters:67116, tasks:2151}, scan_detail: {total_process_keys: 60004037, total_process_keys_size: 9114363007, total_keys: 60006188, get_snapshot_time: 313.8ms, rocksdb: {key_skipped_count: 60004037, block: {cache_hit_count: 318616}}} | offset:0, count:60000000 | N/A | N/A |
| │ └─TableFullScan_23 | 60000000.00 | 60004037 | cop[tikv] | table:orders | tikv_task:{proc max:98ms, min:0s, avg: 12.6ms, p80:22ms, p95:36ms, iters:67116, tasks:2151} | keep order:false | N/A | N/A |
| └─Projection_17(Probe) | 1.00 | 1 | root | | time:2.07ms, loops:2, Concurrency:OFF | cast(tpch100.customer.c_custkey, double BINARY)->Column#19 | 8.86 KB | N/A |
| └─IndexReader_19 | 1.00 | 1 | root | | time:1.97ms, loops:2, cop_task: {num: 1, max: 2.27ms, proc_keys: 0, tot_proc: 2.21µs, tot_wait: 518.6µs, rpc_num: 1, rpc_time: 2.25ms, copr_cache_hit_ratio: 1.00, build_task_duration: 8.62µs, max_distsql_concurrency: 1} | index:IndexRangeScan_18 | 298 Bytes | N/A |
| └─IndexRangeScan_18 | 1.00 | 1 | cop[tikv] | table:a, index:idx1(C_PHONE) | tikv_task:{time:12ms, loops:1}, scan_detail: {get_snapshot_time: 252.4µs, rocksdb: {block: {}}} | range:["11-746-264-1304","11-746-264-1304"], keep order:false | N/A | N/A |
+------------------------------------+-------------+----------+-----------+------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------+-----------+---------+
10 rows in set (16.32 sec)
It can be seen that this is basically consistent with expectations. Therefore, in the current hash_join operator, if the number of records on the buildSide is too large, it is easy to cause OOM-kill due to insufficient statement-level memory.
Can it be enhanced so that if the number of records on the buildSide is too large (larger than the statement-level memory setting * coefficient), another algorithm can be used, which may be slightly slower but supports larger data volumes for the buildSide hashJoin (such as partitioning the build and probe in batches to gradually complete the calculation, as used by other databases)?