After killing a TiKV node, IO is interrupted for more than 10 seconds

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: kill 一个 tikv 节点后 IO 中断 10s 以上

| username: TiDBer_Lm1H3bCW

To improve efficiency, please provide the following information. A clear problem description can help resolve the issue more quickly:

[Background]
We have built a distributed file system using juicefs + tikv + ceph, where the tikv cluster is used to store metadata. The cluster consists of 3 physical nodes, each running a tikv service and a pd service.

[Phenomenon]
We conducted read/write tests on the file system using vdbench. When all services are normal, the IO is normal. However, when we simulate a failure by actively killing the tikv server on one node, an IO interruption occurs, lasting about 15 seconds, which is too long for our business needs.

[TiDB Version]
v 6.5.1

[Attachments] Relevant logs and monitoring
The go-client can immediately capture the error of the tikv service exception.

3-5 seconds later, the pd logs show region leader changes. After a normal leader change, the service should resume. Why is the IO still 0?

| username: TiDBer_UUTlqVvZ | Original post link

Based on the information you provided, we can preliminarily determine that the issue might be related to the Region Leader switch after a TiKV node failure. When a TiKV node fails, PD detects the change in the node’s status and switches the Region Leader from that node to another node. However, since your business experienced an IO interruption, it indicates that there was a problem during the switch process.

One possible reason is that when the Region Leader switches to another node, the new Leader node needs to synchronize data from other nodes, which may take some time. If the data synchronization takes too long, it can lead to an IO interruption. Therefore, we recommend that you check the network status and load conditions of the TiKV cluster to ensure that the data synchronization speed is fast enough.

| username: TiDBer_Lm1H3bCW | Original post link

Thank you very much for your response.

Before conducting the fault simulation, all services were running stably for a period of time, so the data differences between them should not be significant.

Additionally, we understand that synchronization between regions is based on Raft. When there is an anomaly in one of the Raft peers, the most up-to-date one should become the leader. In this case, it shouldn’t need to fetch data from other followers. Could you please provide more details on the data synchronization you mentioned?

| username: pingyu | Original post link

Could you please conduct a similar test using go-ycsb under the same conditions?

The goal is to shorten the chain and eliminate the impact of juicefs.

Among them:
tikv.type should be set to txn
tikv.batchsize should be set to 0 (to avoid Rawkv can not recovery after one of the tikv-servers restart · Issue #522 · tikv/client-go · GitHub)

| username: pingyu | Original post link

Supplement: After the test is completed, please collect and upload the monitoring data of the test process.

Refer to 【SOP 系列 22】TiDB 集群诊断信息收集 Clinic 使用指南&资料大全 - TiDB 的问答社区

| username: TiDBer_Lm1H3bCW | Original post link

Thank you very much for your suggestion, and I apologize for not responding in a timely manner. Here is the feedback from our test results:

Testing with go-ycsb under normal cluster conditions:

nohup ./go-ycsb run tikv --interval 1 -P workloads/workloada -p tikv.pd="100.73.8.11:2379,100.73.8.12:2379,100.73.8.13:2379" -p tikv.type="txn" -p tikv.batchsize=0 > run_with_down_tikv 2>&1 &

image

After normal operation, kill the tikv service of one node. At this time, we can see that IO is interrupted because the count of various operations does not increase.

It slowly recovers after about 15 seconds.

Diagnostic information collected as follows:

tiup diag collect tikvtest -f="2023-05-11T16:13:42+08:00" -t="2023-05-11T16:23:23+08:00"
| username: pingyu | Original post link

The logs show that the kill TiKV event occurred at 16:14:55

May 11, 2023 @ 16:14:55.810 [store.rs:2808] ["broadcasting <mark>unreachable</mark>"] [unreachable_store_id=1] [store_id=5]
May 11, 2023 @ 16:14:55.810 [store.rs:2808] ["broadcasting <mark>unreachable</mark>"] [unreachable_store_id=1] [store_id=4]

And the Raft re-election started after 16:15:05

May 11, 2023 @ 16:15:05.557 [raft.rs:1550] ["starting a new <mark>election</mark>"] [term=11] [raft_id=123] [region_id=120]
May 11, 2023 @ 16:15:06.578 [raft.rs:1550] ["starting a new <mark>election</mark>"] [term=13] [raft_id=11] [region_id=8]
May 11, 2023 @ 16:15:06.582 [raft.rs:1550] ["starting a new <mark>election</mark>"] [term=11] [raft_id=79] [region_id=76]
May 11, 2023 @ 16:15:07.579 [raft.rs:1550] ["starting a new <mark>election</mark>"] [term=9] [raft_id=35] [region_id=32]
May 11, 2023 @ 16:15:07.580 [raft.rs:1550] ["starting a new <mark>election</mark>"] [term=10] [raft_id=43] [region_id=40]

The interval exceeds 10 seconds because the default leader heartbeat timeout in TiKV is 10 seconds.

To shorten the interruption time, you can reduce this heartbeat timeout accordingly. You can add the following content to the TiKV configuration:

[raftstore]
raft-base-tick-interval = "1s"
raft-election-timeout-ticks = 3
raft-heartbeat-ticks = 1
pd-heartbeat-tick-interval = "5s"
raft-store-max-leader-lease = "2s"

The test results are as follows (note the TOTAL row. v6.5.2, PD x 1 + TiKV x 3):

Using request distribution 'uniform' a keyrange of [0 99999]
[2023/05/14 10:41:59.791 +08:00] [INFO] [client.go:392] ["[pd] create pd client with endpoints"] [pd-address="[127.0.0.1:33379]"]
[2023/05/14 10:41:59.796 +08:00] [INFO] [base_client.go:350] ["[pd] switch leader"] [new-leader=http://127.0.0.1:33379] [old-leader=]
[2023/05/14 10:41:59.796 +08:00] [INFO] [base_client.go:105] ["[pd] init cluster id"] [cluster-id=7232856776285670424]
[2023/05/14 10:41:59.797 +08:00] [INFO] [client.go:687] ["[pd] tso dispatcher created"] [dc-location=global]
***************** properties *****************
"requestdistribution"="uniform"
"operationcount"="100000000"
"updateproportion"="0.5"
"readallfields"="true"
"measurement.interval"="1"
"tikv.batchsize"="0"
"dotransactions"="true"
"command"="run"
"readproportion"="0.5"
"tikv.pd"="127.0.0.1:33379"
"tikv.type"="txn"
"workload"="core"
"insertproportion"="0"
"threadcount"="20"
"recordcount"="100000"
"scanproportion"="0"
**********************************************
...
...
READ   - Takes(s): 10.9, Count: 76030, OPS: 6946.1, Avg(us): 746, Min(us): 360, Max(us): 92671, 50th(us): 677, 90th(us): 902, 95th(us): 1020, 99th(us): 2001, 99.9th(us): 5643, 99.99th(us): 55007
TOTAL  - Takes(s): 10.9, Count: 151528, OPS: 13845.0, Avg(us): 1441, Min(us): 360, Max(us): 96319, 50th(us): 1537, 90th(us): 2305, 95th(us): 2589, 99th(us): 4227, 99.9th(us): 7951, 99.99th(us): 56575
UPDATE - Takes(s): 10.9, Count: 75498, OPS: 6909.5, Avg(us): 2141, Min(us): 1255, Max(us): 96319, 50th(us): 1993, 90th(us): 2555, 95th(us): 2915, 99th(us): 5743, 99.9th(us): 8631, 99.99th(us): 74815
UPDATE_ERROR - Takes(s): 9.9, Count: 8, OPS: 0.8, Avg(us): 2803, Min(us): 1452, Max(us): 7531, 50th(us): 2261, 90th(us): 3615, 95th(us): 7531, 99th(us): 7531, 99.9th(us): 7531, 99.99th(us): 7531
READ   - Takes(s): 11.9, Count: 83421, OPS: 6983.2, Avg(us): 741, Min(us): 360, Max(us): 92671, 50th(us): 674, 90th(us): 897, 95th(us): 1014, 99th(us): 1973, 99.9th(us): 5615, 99.99th(us): 55007
TOTAL  - Takes(s): 11.9, Count: 166370, OPS: 13928.4, Avg(us): 1432, Min(us): 360, Max(us): 96319, 50th(us): 1533, 90th(us): 2289, 95th(us): 2563, 99th(us): 4187, 99.9th(us): 7843, 99.99th(us): 56447
UPDATE - Takes(s): 11.9, Count: 82949, OPS: 6954.9, Avg(us): 2126, Min(us): 1255, Max(us): 96319, 50th(us): 1981, 90th(us): 2531, 95th(us): 2883, 99th(us): 5715, 99.9th(us): 8399, 99.99th(us): 74815
UPDATE_ERROR - Takes(s): 10.9, Count: 8, OPS: 0.7, Avg(us): 2803, Min(us): 1452, Max(us): 7531, 50th(us): 2261, 90th(us): 3615, 95th(us): 7531, 99th(us): 7531, 99.9th(us): 7531, 99.99th(us): 7531
[2023/05/14 10:42:12.335 +08:00] [INFO] [region_cache.go:2524] ["[health check] check health error"] [store=127.0.0.1:20162] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:20162: connect: connection refused\""]
[2023/05/14 10:42:12.335 +08:00] [INFO] [region_request.go:794] ["mark store's regions need be refill"] [id=2] [addr=127.0.0.1:20162] [error="rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:50720->127.0.0.1:20162: read: connection reset by peer"] [errorVerbose="rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:50720->127.0.0.1:20162: read: connection reset by peer\ngithub.com/tikv/client-go/v2/tikvrpc.CallRPC\n\t/disk1/home/pingyu/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/tikvrpc/tikvrpc.go:1064\ngithub.com/tikv/client-go/v2/internal/client.(*RPCClient).sendRequest\n\t/disk1/home/pingyu/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/internal/client/client.go:524\ngithub.com/tikv/client-go/v2/internal/client.(*RPCClient).SendRequest\n\t/disk1/home/pingyu/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/internal/client/client.go:533\ngithub.com/tikv/client-go/v2/internal/client.interceptedClient.SendRequest\n\t/disk1/home/pingyu/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/internal/client/client_interceptor.go:42\ngithub.com/tikv/client-go/v2/internal/client.reqCollapse.SendRequest\n\t/disk1/home/pingyu/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/internal/client/client_collapse.go:74\ngithub.com/tikv/client-go/v2/internal/locate.(*RegionRequestSender).sendReqToRegion\n\t/disk1/home/pingyu/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/internal/locate/region_request.go:1184\ngithub.com/tikv/client-go/v2/internal/locate.(*RegionRequestSender).SendReqCtx\n\t/disk1/home/pingyu/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/internal/locate/region_request.go:1017\ngithub.com/tikv/client-go/v2/txnkv/txnsnapshot.(*ClientHelper).SendReqCtx\n\t/disk1/home/pingyu/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/txnkv/txnsnapshot/client_helper.go:146\ngithub.com/tikv/client-go/v2/txnkv/txnsnapshot.(*KVSnapshot).get\n\t/disk1/home/pingyu/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/txnkv/txnsnapshot/snapshot.go:620\ngithub.com/tikv/client-go/v2/txnkv/txnsnapshot.(*KVSnapshot).Get\n\t/disk1/home/pingyu/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/txnkv/txnsnapshot/snapshot.go:529\ngithub.com/tikv/client-go/v2/internal/unionstore.(*KVUnionStore).Get\n\t/disk1/home/pingyu/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/internal/unionstore/union_store.go:102\ngithub.com/tikv/client-go/v2/txnkv/transaction.(*KVTxn).Get\n\t/disk1/home/pingyu/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/txnkv/transaction/txn.go:172\ngithub.com/pingcap/go-ycsb/db/tikv.(*txnDB).Read\n\t/disk1/home/pingyu/workspace/go-ycsb/db/tikv/txn.go:103\ngithub.com/pingcap/go-ycsb/pkg/client.DbWrapper.Read\n\t/disk1/home/pingyu/workspace/go-ycsb/pkg/client/dbwrapper.go:59\ngithub.com/pingcap/go-ycsb/pkg/workload.(*core).doTransactionRead\n\t/disk1/home/pingyu/workspace/go-ycsb/pkg/workload/core.go:429\ngithub.com/pingcap/go-ycsb/pkg/workload.(*core).DoTransaction\n\t/disk1/home/pingyu/workspace/go-ycsb/pkg/workload/core.go:366\ngithub.com/pingcap/go-ycsb/pkg/client.(*worker).run\n\t/disk1/home/pingyu/workspace/go-ycsb/pkg/client/client.go:129\ngithub.com/pingcap/go-ycsb/pkg/client.(*Client).Run.func2\n\t/disk1/home/pingyu/workspace/go-ycsb/pkg/client/client.go:215\nruntime.goexit\n\t/disk1/home/pingyu/opt/go-1.20.2/src/runtime/asm_amd64.s:1598"]
[2023/05/14 10:42:12.336 +08:00] [INFO] [region_cache.go:2524] ["[health check] check health error"] [store=127.0.0.1:20162] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:20162: connect: connection refused\""]
READ   - Takes(s): 12.9, Count: 85747, OPS: 6622.9, Avg(us): 740, Min(us): 360, Max(us): 92671, 50th(us): 673, 90th(us): 896, 95th(us): 1013, 99th(us): 1994, 99.9th(us): 5615, 99.99th(us): 54911
TOTAL  - Takes(s): 12.9, Count: 170985, OPS: 13207.7, Avg(us): 1430, Min(us): 360, Max(us): 96319, 50th(us): 1530, 90th(us): 2283, 95th(us): 2559, 99th(us): 4203, 99.9th(us): 7887, 99.99th(us): 56447
UPDATE - Takes(s): 12.9, Count: 85238, OPS: 6594.1, Avg(us): 2123, Min(us): 1103, Max(us): 96319, 50th(us): 1978, 90th(us): 2527, 95th(us): 2879, 99th(us): 5711, 99.9th(us): 8623, 99.99th(us): 74687
UPDATE_ERROR - Takes(s): 11.9, Count: 8, OPS: 0.7, Avg(us): 2803, Min(us): 1452, Max(us): 7531, 50th(us): 2261, 90th(us): 3615, 95th(us): 7531, 99th(us): 7531, 99.9th(us): 7531, 99.99th(us): 7531
[2023/05/14 10:42:13.336 +08:00] [INFO] [region_cache.go:2524] ["[health check] check health error"] [store=127.0.0.1:20162] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:20162: connect: connection refused\""]
READ   - Takes(s): 13.9, Count: 85747, OPS: 6148.1, Avg(us): 740, Min(us): 360, Max(us): 92671, 50th(us): 673, 90th(us): 896, 95th(us): 1013, 99th(us): 1994, 99.9th(us): 5615, 99.99th(us): 54911
TOTAL  - Takes(s): 13.9, Count: 170985, OPS: 12260.6, Avg(us): 1430, Min(us): 360, Max(us): 96319, 50th(us): 1530, 90th(us): 2283, 95th(us): 2559, 99th(us): 4203, 99.9th(us): 7887, 99.99th(us): 56447
UPDATE - Takes(s): 13.9, Count: 85238, OPS: 6120.8, Avg(us): 2123, Min(us): 1103, Max(us): 96319, 50th(us): 1978, 90th(us): 2527, 95th(us): 2879, 99th(us): 5711, 99.9th(us): 8623, 99.99th(us): 74687
UPDATE_ERROR - Takes(s): 12.9, Count: 8, OPS: 0.6, Avg(us): 2803, Min(us): 1452, Max(us): 753
| username: TiDBer_Lm1H3bCW | Original post link

After parameter validation, there is indeed a significant improvement. The provided operation and maintenance link is also very useful. Thank you, expert.

| username: zhanggame1 | Original post link

Got it, recently I am also planning to do a similar test.

| username: wzf0072 | Original post link

TiKV uses Raft for data replication. Each data change is recorded as a Raft log, and through Raft’s log replication feature, data is safely and reliably synchronized to every node in the replication group. However, in actual writes, according to the Raft protocol, it is considered safe to assume that data is successfully written as long as it is synchronously replicated to the majority of nodes.

A new leader node does not need to synchronize data from other nodes. If its data is lagging, it will not be elected as the leader node.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.