[Overview] Scenario + Problem Overview
There is an SQL query involving multiple table joins that is relatively slow. I used kill tidb query 645; to kill the process, but after execution, the transaction remains in the “in transaction” state. This happens repeatedly. Has it become a zombie process? Is there any way to remove it?
[Background] Operations performed
[Phenomenon] Business and database phenomena
[Problem] Current issue encountered
[Business Impact]
[TiDB Version]
[Application Software and Version]
[Attachments] Relevant logs and configuration information
Corresponding module logs (including logs one hour before and after the issue)
If the question is related to performance optimization or troubleshooting, please download the script and run it. Please select all and copy-paste the terminal output results for upload.
May I ask if the kill command is executed on the corresponding host? That is, you need to execute the kill tidb command on the TiDB where the query is running.
I recently tested this issue. In TiDB 5.3.2, for this type of SQL (most likely multi-table joins with a large limit m, n), after each query, a zombie session with continuously increasing time appears in the background and cannot be killed. In TiDB 5.4.2, the zombie session still appears but can be killed. In TiDB 6.1.1, the zombie session does not appear based on actual tests.
I also encountered this situation in version 5.4. Killing the TiDB ID (transaction ID) and killing the TiDB session_id, even operating on the corresponding TiDB host, still couldn’t kill it. Do I have to upgrade to 6.1?
Our production version is 5.3.2. We are currently considering how to address this issue and whether we need to upgrade. A temporary solution is to force the index hash join in the relevant SQL to be hinted as an index join, which can bypass this bug. This has been tested and is effective.
This issue is similar to [indexHashJoin hang in handleTask · Issue #35638 · pingcap/tidb (github.com)], you can refer to it. However, this has already been fixed in version 5.4 and cannot be reproduced anymore.
Actually, it has already been killed, but it still shows up here. To clear it, just restart the tidb-server; it doesn’t have much impact. To specifically determine whether it has been killed, observe the tidb logs. As long as there is a “kill” entry in the tidb logs, it means the kill has been initiated.
If they can accept this bug, then so be it.
Sometimes when we need to upgrade versions, we also encounter various obstacles.
Make a report on the cause of the problem and the solution, send it to them, and let them make the decision.