Semi-join execution is too slow when there are many duplicate values in the subquery

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 子查询重复值多的情况下半连接执行太慢

| username: 人如其名

【TiDB Usage Environment】Test
【TiDB Version】5.7.25-TiDB-v7.3.0

mysql> explain analyze select count(*) from customer a where exists (select 1 from customer b where a.C_NATIONKEY=b.C_NATIONKEY);
+-------------------------------+------------+---------+-----------+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------+----------+---------+
| id                            | estRows    | actRows | task      | access object | execution info                                                                                                                                                                                                                                                                                                                                           | operator info                                                                   | memory   | disk    |
+-------------------------------+------------+---------+-----------+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------+----------+---------+
| StreamAgg_11                  | 1.00       | 1       | root      |               | time:1h15m6.9s, loops:2                                                                                                                                                                                                                                                                                                                                  | funcs:count(1)->Column#18                                                       | 8 Bytes  | N/A     |
| └─HashJoin_17                 | 1200000.00 | 1500000 | root      |               | time:1h15m6.9s, loops:1467, build_hash_table:{total:585.7ms, fetch:473.8ms, build:111.9ms}, probe:{concurrency:5, total:6h15m19.3s, max:1h15m6.9s, probe:6h15m14.7s, fetch:4.65s}                                                                                                                                                                        | semi join, equal:[eq(tpch10.customer.c_nationkey, tpch10.customer.c_nationkey)] | 46.3 MB  | 0 Bytes |
|   ├─TableReader_16(Build)     | 1500000.00 | 1500000 | root      |               | time:471.4ms, loops:1468, cop_task: {num: 54, max: 66.6ms, min: 1.34ms, avg: 26.2ms, p95: 62.3ms, max_proc_keys: 50144, p95_proc_keys: 50144, tot_proc: 1.19s, tot_wait: 15ms, rpc_num: 54, rpc_time: 1.41s, copr_cache_hit_ratio: 0.00, distsql_concurrency: 15}                                                                                        | data:TableFullScan_15                                                           | 785.3 KB | N/A     |
|   │ └─TableFullScan_15        | 1500000.00 | 1500000 | cop[tikv] | table:b       | tikv_task:{proc max:61ms, min:0s, avg: 21.9ms, p80:41ms, p95:53ms, iters:1678, tasks:54}, scan_detail: {total_process_keys: 1500000, total_process_keys_size: 305225771, total_keys: 1500054, get_snapshot_time: 1.58ms, rocksdb: {key_skipped_count: 1500000, block: {cache_hit_count: 160, read_count: 4992, read_byte: 112.6 MB, read_time: 57.9ms}}} | keep order:false                                                                | N/A      | N/A     |
|   └─TableReader_14(Probe)     | 1500000.00 | 1500000 | root      |               | time:88.1ms, loops:1468, cop_task: {num: 54, max: 50ms, min: 624.9µs, avg: 13.8ms, p95: 45.9ms, max_proc_keys: 50144, p95_proc_keys: 50144, tot_proc: 71ms, tot_wait: 111ms, rpc_num: 54, rpc_time: 745.2ms, copr_cache_hit_ratio: 0.70, distsql_concurrency: 15}                                                                                        | data:TableFullScan_13                                                           | 1.53 MB  | N/A     |
|     └─TableFullScan_13        | 1500000.00 | 1500000 | cop[tikv] | table:a       | tikv_task:{proc max:61ms, min:0s, avg: 21ms, p80:41ms, p95:53ms, iters:1678, tasks:54}, scan_detail: {total_process_keys: 191231, total_process_keys_size: 38891884, total_keys: 191247, get_snapshot_time: 71.8ms, rocksdb: {key_skipped_count: 191231, block: {cache_hit_count: 670, read_count: 15, read_byte: 1019.5 KB, read_time: 1.6ms}}}         | keep order:false                                                                | N/A      | N/A     |
+-------------------------------+------------+---------+-----------+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------+----------+---------+
6 rows in set (1 hour 15 min 6.93 sec)

Table structure is as follows:

mysql> show create table customer \G
*************************** 1. row ***************************
       Table: customer
Create Table: CREATE TABLE `customer` (
  `C_CUSTKEY` bigint(20) NOT NULL,
  `C_NAME` varchar(25) NOT NULL,
  `C_ADDRESS` varchar(40) NOT NULL,
  `C_NATIONKEY` bigint(20) NOT NULL,
  `C_PHONE` char(15) NOT NULL,
  `C_ACCTBAL` decimal(15,2) NOT NULL,
  `C_MKTSEGMENT` char(10) NOT NULL,
  `C_COMMENT` varchar(117) NOT NULL,
  PRIMARY KEY (`C_CUSTKEY`) /*T![clustered_index] CLUSTERED */,
  KEY `idx2` (`C_PHONE`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin
1 row in set (0.01 sec)

mysql> select count(C_NATIONKEY),count(distinct C_NATIONKEY) from customer;
+--------------------+-----------------------------+
| count(C_NATIONKEY) | count(distinct C_NATIONKEY) |
+--------------------+-----------------------------+
|            1500000 |                          25 |
+--------------------+-----------------------------+
1 row in set (0.12 sec)

Why is the semi-join so slow? It should normally return results in a few seconds, right?

| username: 像风一样的男子 | Original post link

This SQL running for 1 hour? That’s a bit exaggerated.

| username: Fly-bird | Original post link

Check the cluster resource utilization.

| username: Kongdom | Original post link

I don’t understand, what is this query trying to check???

explain analyze select count(*) from customer a where exists (select 1 from customer b where a.C_NATIONKEY=b.C_NATIONKEY);

| username: 人如其名 | Original post link

This is purely technical, discussing why semi-joins are slow, focusing only on semi-joins.
Let me explain the overall background of this issue:
In real scenarios, we often encounter subqueries like this: select xx from a where a.col1 in (select b.col1 from b). By default, the TiDB optimizer will deduplicate b.col1 and then associate it with a.col1. For example:

mysql> explain select count(*) from customer where c_custkey in (select o_custkey from orders);
+----------------------------------+-------------+-----------+----------------+----------------------------------------------------------------------------------------------------+
| id                               | estRows     | task      | access object  | operator info                                                                                      |
+----------------------------------+-------------+-----------+----------------+----------------------------------------------------------------------------------------------------+
| StreamAgg_12                     | 1.00        | root      |                | funcs:count(1)->Column#18                                                                          |
| └─HashJoin_43                    | 1009664.00  | root      |                | inner join, equal:[eq(tpch10.customer.c_custkey, tpch10.orders.o_custkey)]                         |
|   ├─HashAgg_28(Build)            | 1009664.00  | root      |                | group by:tpch10.orders.o_custkey, funcs:firstrow(tpch10.orders.o_custkey)->tpch10.orders.o_custkey |
|   │ └─TableReader_29             | 1009664.00  | root      |                | data:HashAgg_24                                                                                    |
|   │   └─HashAgg_24               | 1009664.00  | cop[tikv] |                | group by:tpch10.orders.o_custkey,                                                                  |
|   │     └─TableFullScan_27       | 15000000.00 | cop[tikv] | table:orders   | keep order:false                                                                                   |
|   └─TableReader_33(Probe)        | 1500000.00  | root      |                | data:TableFullScan_32                                                                              |
|     └─TableFullScan_32           | 1500000.00  | cop[tikv] | table:customer | keep order:false                                                                                   |
+----------------------------------+-------------+-----------+----------------+----------------------------------------------------------------------------------------------------+
8 rows in set (0.00 sec)

If the duplicate values of b.col1 are relatively low, this aggregation is particularly inefficient and requires a high cost for aggregation. I believe that directly using a semi-join would be more efficient.
Why does TiDB always perform aggregation and deduplication on table b? This is because of the parameter:

tidb_opt_insubq_to_join_and_agg

  • Scope: SESSION | GLOBAL
  • Persisted to cluster: Yes
  • Type: Boolean
  • Default value: ON
  • This variable is used to set whether to enable the optimization rule: converting subqueries to join and aggregation.

When we turn off this parameter, we can see that it can use a semi-join:

mysql> set tidb_opt_insubq_to_join_and_agg=OFF;
Query OK, 0 rows affected (0.00 sec)

mysql> explain select count(*) from customer where c_custkey in (select o_custkey from orders);
+-------------------------------+-------------+-----------+----------------+---------------------------------------------------------------------------+
| id                            | estRows     | task      | access object  | operator info                                                             |
+-------------------------------+-------------+-----------+----------------+---------------------------------------------------------------------------+
| StreamAgg_10                  | 1.00        | root      |                | funcs:count(1)->Column#18                                                 |
| └─HashJoin_16                 | 1200000.00  | root      |                | semi join, equal:[eq(tpch10.customer.c_custkey, tpch10.orders.o_custkey)] |
|   ├─TableReader_15(Build)     | 15000000.00 | root      |                | data:TableFullScan_14                                                     |
|   │ └─TableFullScan_14        | 15000000.00 | cop[tikv] | table:orders   | keep order:false                                                          |
|   └─TableReader_13(Probe)     | 1500000.00  | root      |                | data:TableFullScan_12                                                     |
|     └─TableFullScan_12        | 1500000.00  | cop[tikv] | table:customer | keep order:false                                                          |
+-------------------------------+-------------+-----------+----------------+---------------------------------------------------------------------------+
6 rows in set (0.00 sec)

The question here is why the optimizer defaults to deduplication?

  1. If the query field in the subquery has very few duplicate values, deduplication is not very meaningful, and the cost of aggregation is higher.
  2. If the query field in the subquery has many duplicate values, although deduplication has some significance, the semi-join terminates the probe match for the next row (early out) when it encounters matching rows, so the efficiency of the semi-join should not be too slow.
    However, TiDB has set a parameter tidb_opt_insubq_to_join_and_agg to control this (hoping it will be optimized to be adaptive in the future), which by default deduplicates the in-subquery fields every time. I think there should still be a performance degradation when there are many duplicates in the subquery fields in TiDB, so checking whether the efficiency is lower when there are many duplicate values in the semi-join subquery fields became the main issue of this post.
  3. To avoid introducing multi-table plans, only use the customer table from tpch10 for subquery testing:
-- TiDB does not rewrite and optimize such self-queries into single-table queries, so a single table can be used for subquery testing
select count(*) from customer a where a.C_NATIONKEY in (select C_NATIONKEY from customer b );
  1. To avoid the additional impact of the tidb_opt_insubq_to_join_and_agg parameter defaulting to deduplication in the in-subquery, and to focus more on the semi-join, rewrite the above statement into an exists form to directly use the semi-join form (so that the parameter issue does not need to be introduced in the post, but it is still explained here…):
mysql> explain select count(*) from customer a where exists (select 1 from customer b where a.C_NATIONKEY=b.C_NATIONKEY);
+-------------------------------+------------+-----------+---------------+---------------------------------------------------------------------------------+
| id                            | estRows    | task      | access object | operator info                                                                   |
+-------------------------------+------------+-----------+---------------+---------------------------------------------------------------------------------+
| StreamAgg_11                  | 1.00       | root      |               | funcs:count(1)->Column#18                                                       |
| └─HashJoin_17                 | 1200000.00 | root      |               | semi join, equal:[eq(tpch10.customer.c_nationkey, tpch10.customer.c_nationkey)] |
|   ├─TableReader_16(Build)     | 1500000.00 | root      |               | data:TableFullScan_15                                                           |
|   │ └─TableFullScan_15        | 1500000.00 | cop[tikv] | table:b       | keep order:false                                                                |
|   └─TableReader_14(Probe)     | 1500000.00 | root      |               | data:TableFullScan_13                                                           |
|     └─TableFullScan_13        | 1500000.00 | cop[tikv] | table:a       | keep order:false                                                                |
+-------------------------------+------------+-----------+---------------+---------------------------------------------------------------------------------+
6 rows in set (0.00 sec)

Therefore, the above post was made, which seems to have no business significance, but hopes to focus more on the issue of the operator itself.
In business, there can be many optimizations, such as deduplication if there are many duplicate fields, adding indexes, etc., but I hope the product itself has stronger fallback capabilities, as not every developer understands performance tuning.

| username: 托马斯滑板鞋 | Original post link

I want to ask how much data this is? I want to try it on other databases as well :upside_down_face:

| username: 人如其名 | Original post link

TPCH SF=10

| username: Roger_Song | Original post link

We are tracking the issue of semi join execution efficiency here at execution: semi join takes too long to run · Issue #47424 · pingcap/tidb · GitHub.

| username: Roger_Song | Original post link

This transformation is cost-based. Currently, cost-based logical optimization is not supported, but it will be supported in the future cascades framework for automatic selection.

| username: 人如其名 | Original post link

Is it based on rules? Was that a slip of the tongue?

| username: zanmato | Original post link

Detailed work, bro :+1:

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.