The SQL query results are returned, but it remains in a running state

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: sql 结果查询出来,但是一直处于运行状态

| username: CuteRay

[TiDB Usage Environment] Production Environment
[TiDB Version] v5.3.2
[Encountered Problem] A very strange phenomenon, running a SQL on the database, the result can be queried, but the session is always in the query state, and killing the TiDB ID is ineffective. Only this SQL has such a situation.

Through cluster_processlist, it is observed that the query time of this session keeps increasing, but in fact, the result has already been returned:

When executing the SQL through the MySQL client, the same situation occurs: the SQL execution is completed, the result has been returned, but the session always exists. Even closing the MySQL client does not terminate it. Logging into the corresponding TiDB node to kill the session also fails. It was found that there is a bug in the kill command before v5.4. The only solution currently is to restart the corresponding TiDB-server.

SQL:

select *
from (SELECT
      wde.weight_time,
      wde.review_end_time,
      wt.sign_ask,
      wdst.goodcount,
      wd.pack_no,
      wde.handover_time,
      CONCAT(
        IFNULL(service_province, ""),
        IFNULL(service_city, ""),
        IFNULL(service_district, ""),
        IFNULL(service_street, "")
      ) receive_address,
      wt.trans_ask
      FROM wd
               left JOIN wt
                         ON wd.do_id = wt.order_id
               left JOIN wde
                         ON wd.do_id = wde.do_id
               left JOIN wdst ON wd.do_id = wdst.orderid
      WHERE 1 = 1
        AND house_id = '3401001'
      AND partner_id = 'test'
      AND create_time >= '2022-09-01 00:00:00.0'
      AND create_time <= '2022-09-12 00:00:00.0'
      AND cust_id_list IN (
        '12345678',
        '22345678',
        '32345678',
        '42345678'
      )) t
order by create_time desc
limit 80000, 200;

Currently, rewriting the SQL resolves the issue:

SELECT
  *
FROM
(
    SELECT
      wde.weight_time,
      wde.review_end_time,
      wt.sign_ask,
      wdst.goodcount,
      wd.pack_no,
      wde.handover_time,
      CONCAT(
        IFNULL(service_province, ""),
        IFNULL(service_city, ""),
        IFNULL(service_district, ""),
        IFNULL(service_street, "")
      ) receive_address,
      wt.trans_ask
    FROM
      (
      SELECT * FROM wd WHERE
      1 = 1
      AND house_id = '3401001'
      AND partner_id = 'test'
      AND create_time >= '2022-09-01 00:00:00.0'
      AND create_time <= '2022-09-12 00:00:00.0'
      AND cust_id_list IN (
        '12345678',
        '22345678',
        '32345678',
        '42345678'
      )
      ORDER BY
  create_time DESC
     LIMIT
 80000, 200
  ) wd

      LEFT JOIN wt ON wd.do_id = wt.order_id
      LEFT JOIN wde ON wd.do_id = wde.do_id
    LEFT JOIN wdst ON wd.do_id = wdst.orderid
  ) t

So, I would like to know the cause of this phenomenon.

Attached Clinic diagnostic data:

| username: CuteRay | Original post link

Due to business reasons, rewriting this SQL would have too much impact on the source code, and currently, rewriting is not supported. Does anyone have a good solution? Or does anyone know the cause of this phenomenon?

| username: CuteRay | Original post link

Setting the global max_execution_time=3600000 for the cluster also has no effect.
image

| username: CuteRay | Original post link

cpu.profile (91.1 KB)
heap.profile (3.0 MB)
goroutine (12.9 KB)

| username: 人如其名 | Original post link

The goroutine won’t start.

| username: tidb狂热爱好者 | Original post link

CONCAT( IFNULL(service_province, “”), IFNULL(service_city, “”), IFNULL(service_district, “”), IFNULL(service_street, “”) ) receive_address, the main issue is here. Check your row count; if it exceeds 1000 rows, it will consume TiKV’s CPU for computation. One node’s TiKV will be at 100%.

| username: CuteRay | Original post link

I don’t think so. Look at the SQL I rewrote; it still has this query. Moreover, I removed this one, and it’s still the same. The result came out, but it keeps running.

| username: tidb狂热爱好者 | Original post link

Take a look at how many rows are in the result.

| username: tidb狂热爱好者 | Original post link

This means that sorting is very time-consuming.

| username: CuteRay | Original post link

The same, you see the rewritten SQL, it also has limit 80000, 200.

| username: CuteRay | Original post link

The image is not visible. Please provide the text content for translation.

| username: tidb狂热爱好者 | Original post link

200 lines are not much. Is it possible that there is no commit in the code?

| username: CuteRay | Original post link

autocommit

| username: 特雷西-迈克-格雷迪 | Original post link

I have encountered this as well. It should be a bug; it can’t be killed (even checking CLUSTER_PROCESSLIST doesn’t work), and it occupies memory without releasing it. In my case, it consumes a large amount of memory.

| username: CuteRay | Original post link

Mine is the same, I’m planning to debug and see if I can reproduce it.

| username: 特雷西-迈克-格雷迪 | Original post link

You can share the results here, haha.

| username: hey-hoho | Original post link

Similar issue: indexHashJoin hang in handleTask · Issue #35638 · pingcap/tidb · GitHub