The transaction still exists after using kill tidb query to terminate the query

translator_bot · June 23, 2024, 2:22am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 查询事务用kill tidb query之后还存在

| username: Johnpan

[TiDB Usage Environment]
tidb v5.4.0

[Overview] Scenario + Problem Overview
There is an SQL query involving multiple table joins that is relatively slow. I used kill tidb query 645; to kill the process, but after execution, the transaction remains in the “in transaction” state. This happens repeatedly. Has it become a zombie process? Is there any way to remove it?

[Background] Operations performed

[Phenomenon] Business and database phenomena

[Problem] Current issue encountered

[Business Impact]

[TiDB Version]

[Application Software and Version]

[Attachments] Relevant logs and configuration information

TiUP Cluster Display information
TiUP Cluster Edit config information

Monitoring (https://metricstool.pingcap.com/)

TiDB-Overview Grafana monitoring
TiDB Grafana monitoring
TiKV Grafana monitoring
PD Grafana monitoring
Corresponding module logs (including logs one hour before and after the issue)

If the question is related to performance optimization or troubleshooting, please download the script and run it. Please select all and copy-paste the terminal output results for upload.

translator_bot · June 23, 2024, 2:22am

| username: Johnpan | Original post link

translator_bot · June 23, 2024, 2:22am

| username: db_user | Original post link

May I ask if the kill command is executed on the corresponding host? That is, you need to execute the kill tidb command on the TiDB where the query is running.

translator_bot · June 23, 2024, 2:22am

| username: BraveChen | Original post link

If you can come to ask, you won’t make this kind of mistake, haha.

translator_bot · June 23, 2024, 2:22am

| username: BraveChen | Original post link

I recently tested this issue. In TiDB 5.3.2, for this type of SQL (most likely multi-table joins with a large limit m, n), after each query, a zombie session with continuously increasing time appears in the background and cannot be killed. In TiDB 5.4.2, the zombie session still appears but can be killed. In TiDB 6.1.1, the zombie session does not appear based on actual tests.

translator_bot · June 23, 2024, 2:22am

| username: CAICAI | Original post link

I also encountered this situation in version 5.4. Killing the TiDB ID (transaction ID) and killing the TiDB session_id, even operating on the corresponding TiDB host, still couldn’t kill it. Do I have to upgrade to 6.1?

translator_bot · June 23, 2024, 2:22am

| username: forever | Original post link

Looking at previous posts, it’s always about restarting the TiDB server.

translator_bot · June 23, 2024, 2:22am

| username: BraveChen | Original post link

Our production version is 5.3.2. We are currently considering how to address this issue and whether we need to upgrade. A temporary solution is to force the index hash join in the relevant SQL to be hinted as an index join, which can bypass this bug. This has been tested and is effective.

translator_bot · June 23, 2024, 2:22am

| username: BraveChen | Original post link

Restarting can indeed kill zombie sessions, but you can’t keep restarting.

translator_bot · June 23, 2024, 2:22am

| username: buddyyuan | Original post link

You encountered this issue:

github.com/pingcap/tidb

server: a better way to handle killed connection (#32809)

pingcap:release-5.4 ← ti-srebot:release-5.4-403dcfd32d84

opened 02:35AM - 15 Sep 22 UTC

ti-srebot

+15 -9

cherry-pick #32809 to release-5.4 You can switch your code base to this Pull Req…uest by using [git-extras](https://github.com/tj/git-extras): ```bash # In tidb repo: git pr https://github.com/pingcap/tidb/pull/37834 ``` After apply modifications, you can push your change to this PR via: ```bash git push git@github.com:ti-srebot/tidb.git pr/37834:release-5.4-403dcfd32d84 ``` --- ### What problem does this PR solve? Issue Number: close #24031, this PR also reverts #29212. Problem Summary: ### What is changed and how it works? The root cause of #24031 is that when a connection is idle, the goroutine is blocked at: https://github.com/pingcap/tidb/blob/4a0d387e1ff1b508bbb60d484d97e4ac2a5ef2c7/server/conn.go#L1068 And the stack: ``` # 0x13fc361 bufio.(*Reader).Read+0x221 /home/bb7133/Softwares/go/src/bufio/bufio.go:227 # 0x34c8fba github.com/pingcap/tidb/server.bufferedReadConn.Read+0x5a /home/bb7133/Projects/gopath/src/github.com/pingcap/tidb/server/buffered_read_conn.go:31 # 0x1342886 io.ReadAtLeast+0x86 /home/bb7133/Softwares/go/src/io/io.go:328 # 0x34a96e4 io.ReadFull+0x84 /home/bb7133/Softwares/go/src/io/io.go:347 # 0x34a96ab github.com/pingcap/tidb/server.(*packetIO).readOnePacket+0x4b /home/bb7133/Projects/gopath/src/github.com/pingcap/tidb/server/packetio.go:86 # 0x34a9aee github.com/pingcap/tidb/server.(*packetIO).readPacket+0x4e /home/bb7133/Projects/gopath/src/github.com/pingcap/tidb/server/packetio.go:117 # 0x3479624 github.com/pingcap/tidb/server.(*clientConn).readPacket+0x1e4 /home/bb7133/Projects/gopath/src/github.com/pingcap/tidb/server/conn.go:397 # 0x34795ea github.com/pingcap/tidb/server.(*clientConn).Run+0x1aa /home/bb7133/Projects/gopath/src/github.com/pingcap/tidb/server/conn.go:1068 # 0x34b2d1d github.com/pingcap/tidb/server.(*Server).onConn+0x12bd /home/bb7133/Projects/gopath/src/github.com/pingcap/tidb/server/server.go:554 ``` Because of that, the goroutine is not able to deal with the `KILLED` flag, release the resource it is holding and stop itself immediately. In order to solve that, we need to make `conn.Read()` *interruptable* but there is no straightforward way in Go to do that. Some references: 1) https://github.com/golang/go/issues/20280: a lot of discussions/arguments without a clear conclusion. 2) [Canceling I/O in Go Cap’n Proto](https://medium.com/@zombiezen/canceling-i-o-in-go-capn-proto-5ae8c09c5b29): mentioned in `go/issues/20280` 3) https://github.com/google/mtail/pull/497: a context-based implementation for canceling the `Read()` For the approach introduced in 2 and 3, they are generally the same as this PR: setting `SetReadDeadline` in another goroutine. I cannot find any material describing if doing so is thread-safe, so it should be implementation-dependent and might not be safe, but it might not be a real problem considering we're about the kill the connection and the read buffer/status will be abandoned. ### Alternatives * `SHOW PROCESSLIST` (and infoschema) is modified to show the State as `Killed`, as mentioned by @morgo in https://github.com/pingcap/tidb/issues/24031#issuecomment-893095960, the result of `SHOW PROCESSLIST` can be clear to the user but it doesn't solve the delayed 'release lock' issue(see `Case 2` in 'Manual test' part). * Instead of setting the read timeout to `waitTimeout`, the code is instead modified to have a hard coded `2s` timeout, but loops for up to `waitTimeout` retrying a read..., also mentioned by @morgo in https://github.com/pingcap/tidb/issues/24031#issuecomment-893095960, the potential thread-safe concern can be avoided but we still have at most `2s` delay for killing an idle connection and the code would be complicated. * Instead of setting `SetReadDeadline()`, `bufReadConn.Close()` can be another solution. It is basically the same with `SetReadDeadline()` IMHO. ### Check List Tests - [ ] Unit test - [ ] Integration test - [x] Manual test (add detailed scripts or steps below) - [ ] No code Case1: ``` Session1> (in idle state, with PROCESS_ID=3) Session2> KILL TIDB 3; Session2> SHOW PROCESSLIST; (Can be confirmed that Session1 is killed) ``` Case2: ``` Session1> CREATE TABLE t1(a INT); Session1> INSERT INTO t1 values (1); Session1> BEGIN PESSIMISTIC; Session1> SELECT * FROM t1 WHERE a=1 FOR UPDATE; Session1> (in idle state, with PROCESS_ID=3) Session2> BEGIN PESSIMISTIC; Session2> SELECT * FROM t1 WHERE a=1 FOR UPDATE; (Session 2 is blocked and waiting for the lock) Session3> KILL TIDB 3; (Can be confirmed that Session 1 is killed and Session2 is able to acquire the lock immediately). ``` Side effects - None ### Release note ```release-note fix the issue that `KILL TIDB` doesn't take effect immediately on idle connections ```

It should be fixed in version 5.4.1.

translator_bot · June 23, 2024, 2:22am

| username: BraveChen | Original post link

This issue is similar to [indexHashJoin hang in handleTask · Issue #35638 · pingcap/tidb (github.com)], you can refer to it. However, this has already been fixed in version 5.4 and cannot be reproduced anymore.

translator_bot · June 23, 2024, 2:22am

| username: buddyyuan | Original post link

Actually, it has already been killed, but it still shows up here. To clear it, just restart the tidb-server; it doesn’t have much impact. To specifically determine whether it has been killed, observe the tidb logs. As long as there is a “kill” entry in the tidb logs, it means the kill has been initiated.

translator_bot · June 23, 2024, 2:22am

| username: Johnpan | Original post link

Okay, let me take a look.

translator_bot · June 23, 2024, 2:22am

| username: Johnpan | Original post link

Yes, I just re-executed kill tidb 645; and it worked.

translator_bot · June 23, 2024, 2:22am

| username: 张雨齐0720 | Original post link

For this issue, the lower version requires a restart each time. Maybe consider upgrading.

translator_bot · June 23, 2024, 2:22am

| username: Johnpan | Original post link

Well, I also want to upgrade, but the architect doesn’t agree.

translator_bot · June 23, 2024, 2:22am

| username: 张雨齐0720 | Original post link

If they can accept this bug, then so be it.
Sometimes when we need to upgrade versions, we also encounter various obstacles.
Make a report on the cause of the problem and the solution, send it to them, and let them make the decision.

translator_bot · June 23, 2024, 2:22am

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.