Impact of a TiKV Replica Failure (e.g., Server Malfunction) on Business SQL Latency

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIKV 挂了一个副本【例如服务器故障】对业务SQL耗时的影响

| username: residentevil

【TiDB Usage Environment】Production Environment
【TiDB Version】V6.5.8
【Encountered Problem: Problem Phenomenon and Impact】 If one TIKV replica goes down (e.g., server failure), what is the impact on the business SQL execution time (in scenarios with sufficient margin)? Will TiDB still send SQL requests to the failed TIKV node?

| username: lemonade010 | Original post link

TiDB has already detected the faulty TiKV and will definitely not issue SQL, but the related region will definitely switch, which may have some IO impact.

| username: zhanggame1 | Original post link

Assuming that reads and writes were originally distributed across 3 TiKV nodes, if one TiKV node goes down, obviously reads and writes can only be distributed across the remaining 2 nodes. These two TiKV nodes will have a higher load, but if resources are sufficient, there will be no significant impact. TiDB will cache the region distribution on each TiKV node. If it cannot access that TiKV node or cannot find the corresponding region, TiDB will access PD to get the new region distribution information and cache it.

| username: TiDBer_jYQINSnf | Original post link

Hanging one will cause the other two to have higher read and write pressure. The region that loses the leader will quickly select a new one, which will not affect the execution of SQL.

| username: residentevil | Original post link

If there is still a backup strategy, then there is no problem.

| username: 路在何chu | Original post link

Overall performance will decrease, and the CPU load will be distributed to the remaining two nodes.

| username: residentevil | Original post link

Here, let’s not consider the margin issue for now. PD has detected an abnormal TiKV instance, so the requests from the upper layer should not be directed to this abnormal TiKV.

| username: 小龙虾爱大龙虾 | Original post link

Yes, if the leader of the Region is moved, it will no longer send.

| username: tidb菜鸟一只 | Original post link

The impact on the business should be minimal. The region re-elects the leader very quickly, and your backoff can succeed. It’s just that the pressure from three machines is now distributed to two machines.

| username: redgame | Original post link

I think the impact is still there, but it’s also true that it’s not significant.

| username: heiwandou | Original post link

There will be fluctuations.

| username: residentevil | Original post link

On the other hand, I’m actually more concerned about removing the failed node from the topology when encountering an exception, haha, have you ever encountered this?

| username: lemonade010 | Original post link

So far, I haven’t encountered this issue. It can only be analyzed on a case-by-case basis.

| username: changpeng75 | Original post link

If a TiKV node server goes down, the Regions with Leaders on this server will undergo re-election on other nodes, causing delays in read and write operations on those Regions. Regions whose Leaders are not on this server will generally not be affected. Write operations must be performed on the Region Peer where the Leader is located, while read operations may be performed on Followers.

| username: residentevil | Original post link

There is no need for strong consistency reads in the business, so reading data from the FOLLOWER in TiKV is also OK. Just to ask, have you encountered any unexpected issues when using the tiup tool to remove a faulty TiKV instance? For example, failure to remove, etc.

| username: changpeng75 | Original post link

For a TiKV node that has already crashed, you should check the cluster status and wait until all involved regions have completed their elections before using the --force option to forcibly scale down the node.

| username: tidb菜鸟一只 | Original post link

If there are only 3 nodes with three replicas, one node failure will not affect usage. However, the failed node cannot be removed directly. You must first add another node before you can remove the failed one because there is no place to store one of the replicas with only 3 nodes.

| username: dba远航 | Original post link

Other TiKV nodes will definitely experience an increased overall load, as the original ones will no longer receive traffic.

| username: residentevil | Original post link

Adding and then deleting operations? I didn’t see any such operations in the official documentation, only simple scale-in or scale-out operations. :sweat_smile:

| username: residentevil | Original post link

What is the status of a failed TIKV node in the cluster topology? Have you encountered this before? I want to take some time to simulate this scenario.