Killing the current statement during the non-parallel hashAgg data spill read phase is ineffective

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 非并行hashAgg数据落盘读取阶段kill当前语句不生效

| username: 人如其名

Bug Report
Clearly and accurately describe the issue you found. Providing any steps to reproduce the issue can help the development team address it promptly.
[TiDB Version] v6.5.0, actually this issue exists in all versions below 7.0
[Impact of the Bug] During the hashAgg data spill read phase, the session cannot be quickly killed using the kill command.

[Possible Steps to Reproduce]
set tidb_mem_quota_query=‘1G’;
set tidb_enable_tmp_storage_on_oom=ON;
set tidb_hashagg_final_concurrency=1;
set tidb_hashagg_partial_concurrency=1;
set tidb_isolation_read_engines=‘tikv’;
tpch100->orders table
explain analyze select o_custkey, sum(o_totalprice) from orders group by o_custkey order by sum(o_totalprice) desc limit 10;
When a large amount of data spills to disk, in another session: kill this statement.
[Observed Unexpected Behavior]
The statement connection will not be killed.
[Expected Behavior]
It should be killed quickly.

Issue Analysis:
When there is enough memory, hashAgg checks for kill operations while fetching data from child operators through the Next call. However, when hashAgg data spills to disk and reads from disk, it will not be killed because there is no kill session check mechanism in the listInDisk.GetChunk method.

Can we add a kill check before (or after) the listInDisk.GetChunk method call?

| username: 人如其名 | Original post link

You should add the gray part of the code to solve it, right?

| username: 人如其名 | Original post link

The description here is inaccurate; it should be “won’t be killed for a period of time.” This period is actually when data is read from the disk into memory and the result set computation (e.prepared) is completed and returned to chk for use by the parent operator. When the parent operator calls the general Next() function, the embedded kill will terminate the current session. If there is a lot of data to be written to disk, the first batch will be the largest, the second batch will be smaller, and the last batch may be very small. Therefore, if the kill occurs while processing the first batch of data written to disk, it may take a long time; the second batch will take less time, and the last batch may be very quick.

| username: aytrack | Original post link

Your analysis is correct. This issue will be tracked here: Kill TiDB does not take effect in time during non-parallel hashAgg read data from spill data · Issue #43741 · pingcap/tidb · GitHub

| username: 人如其名 | Original post link

Similarly, do hashJoin spilling to disk and sort spilling to disk have similar issues that need to be fixed together?

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.