TiKV Transaction Rollback Takes Too Long

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv事务回滚耗时过长

| username: vincent

[TiDB Usage Environment] Production Environment
[TiDB Version] v5.4.0
[Encountered Issues: Problem Phenomenon and Impact]
Usage Scenario: In our production environment, we use TiKV as a distributed transaction storage engine, and the use of TiKV is based on transaction mechanisms.
Cluster Topology: The cluster has a total of 13 TiKV nodes, each TiKV node has 24 TiKV server instances, and the isolation level is host.
Machine Configuration: Each machine has 64 cores and 256GB of memory, each TiKV node has 24 SSD disks, and each TiKV server instance uses a separate SSD disk.
TiKV Data: 88TB

Problem Phenomenon: One night, a TiKV node crashed, affecting 25% of the requests. The client had retry behavior, and due to TiKV rollback, it took 18 minutes for the request success rate to fully recover. During the failure, read QPS: 45k, write QPS: 800.
Proportion of affected put requests:


Transaction Rollback:

Questions:

  1. Why does the crash of one TiKV node affect 25% of the requests? Could this range of impact be due to a significant imbalance in TiKV itself?
  2. When a TiKV node crashes, it affects 24 TiKV-server instances. Why does the transaction rollback mechanism take so long?
  3. Is there any way to speed up the transaction rollback?
| username: vincent | Original post link

I hope the experts in the community can take a look, thank you :handshake:

| username: h5n1 | Original post link

  1. The number of damaged TiKV nodes cannot be directly compared to the impact on the business side.
  2. In theory, a damaged TiKV node will not affect TiDB’s transaction rollback. There is no concept of rollback on the TiKV side. When a TiKV host goes down, there will be a Leader transfer. During this period, TiDB’s access to TiKV will have retry backoff. You can check the leader monitoring in the overview → TiKV section.
| username: vincent | Original post link

The leader indeed transferred, but it did not recover immediately; instead, it was a slow recovery process.
Leader transfer:
You can see that before the machine restarted, the number of leaders in the faulty TiKV store had already become 0, indicating that the leaders were transferred away.


Additionally, this is the scheduler-rollback monitoring I saw during the fault period, which coincides with the fault impact time:

| username: h5n1 | Original post link

Normal behavior, the way to adjust the leader transfer speed is to adjust the store limit and several schedule limits. You can check them separately through pd-ctl store limit and pd-ctl config show.

| username: vincent | Original post link

So, actually, it’s because there were requests during the balance leader period, which caused the impact time to be extended.

| username: h5n1 | Original post link

Yes, TiDB has a region cache that records the leader as the downed TiKV. When TiDB accesses it, an error will be returned. TiDB will use the returned information to retry the access.

| username: h5n1 | Original post link

Now, after a TiKV host went down, it caused errors in your frontend business, with an impact of about 25%? The scheduler-rollback monitoring shows the consumption of the transaction rollback command in TiKV.

| username: vincent | Original post link

Yes, about 25% of the business reported errors, and it took 18 minutes to fully recover. It was caused by a machine reboot. We think this impact duration is too long and the scope of the impact is quite large.

| username: zhanggame1 | Original post link

I don’t quite understand such a complex architecture either. I see that there are 24 TiKV instances on one machine. Isn’t that too many and affecting the recovery speed? Also, how is the load on the TiKV nodes? Is the CPU or IO particularly high?

I don’t quite understand why there are so many TiKV instances with only one SSD. Wouldn’t there be IO issues?

| username: redgame | Original post link

Try optimizing the TiKV configuration to speed up recovery. For example, adjust relevant parameters (such as raft-store-max-peer-count, raft-apply-pool-size, etc.) to improve recovery speed. Additionally, ensuring sufficient network connectivity and bandwidth between nodes is also crucial.

| username: vincent | Original post link

There are 24 SSD disks on one machine, each disk serving as a data partition for a TiKV store. The CPU load is manageable, with 64 cores and an average load of around 30. The overall disk IO for the cluster is around 70%.

| username: vincent | Original post link

Well, we will rehearse this adjustment because we can’t simulate the current situation. The cluster needs to have a certain amount of data and an appropriate scale.

In our use case, we use TiKV transactions, where a transaction includes multiple operations, such as first getting a certain key, then putting, and then getting again. When a node crashes, many transactions should fail.

We will verify values like raft-store-max-peer-count and raft-apply-pool-size during our rehearsal.

| username: h5n1 | Original post link

Is the black line for the leader here a bunch of TiKV instances stacked together? There was no data in this monitoring earlier, it looks like there was a problem before, then when the monitoring data started to appear, it came up, and then it went down and restarted. Can you check if there is data before this black line?
Also, check the CPU utilization and IO situation of the TiKV nodes during the problematic time period.

| username: vincent | Original post link

Yes, the black lines here represent the 24 TiKV stores on this machine.
The lack of data in the previous monitoring is due to a machine failure, which lasted about 30 seconds, and then it directly crashed and restarted.
The CPU load is as follows:


Disk IOps:
image
Disk utilization:
image
CPU usage:

| username: h5n1 | Original post link

Does the CPU and IO load above include other nodes?

| username: vincent | Original post link

No, it only includes this node.

| username: h5n1 | Original post link

You need to look at other nodes because this one has already crashed. Insufficient resources on normal nodes may affect leader election, but according to the official documentation, the maximum timeout is only 20 seconds. Additionally, there is a silent region feature that, when enabled by default, sends a heartbeat every 5 minutes, which may affect the election.

| username: dba-kit | Original post link

Did you deploy in RawKV mode without deploying the tidb-server component?

| username: vincent | Original post link

We haven’t deployed TiDB.
We are using the Go client: github.com/tikv/client-go/v2/tikv
Using its: KVTxn, corresponding to: transaction.KVTxn