Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: tikv三副本,加入只有三台tikv机器,宕掉一台,三个副本会怎么操作?
[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version]
[Encountered Problem]
[Reproduction Path] Default three replicas, and only three TiKV instances. If one TiKV instance goes down, will the cluster still function normally?
[Problem Phenomenon and Impact]
[Attachment]
Please provide the version information of each component, such as cdc/tikv, which can be obtained by executing cdc version/tikv-server --version.
It’s not normal, you must immediately add TiKV nodes to restore to the state of three instances. After the region replenishment and scheduling are completed, the state will be restored.
In general, the production environment will use a deployment mode of 5 nodes and 5 replicas.
With 5 replicas, if one TiKV instance goes down, the entire cluster becomes unusable?
That definitely won’t happen. With 5 replicas, it can tolerate 2 replica failures.
With TiKV’s three replicas on three TiKV machines, if one machine goes down, there will only be 2 replicas left. The cluster can still function normally at this point, but if another machine goes down, it will no longer be able to provide normal service. Therefore, it is necessary to add another KV node.
A triangle is the most stable.
The leader of the crashed node switches and rebalances to the remaining two machines, allowing the cluster to function normally.
That means as long as most of the KV nodes are alive, the cluster can still function normally.
Not necessarily. This refers to a three-node setup. If there are more nodes, a three-replica cluster with two KV nodes down could also lead to anomalies.
Under normal circumstances, with three replicas and three TiKV instances, if one TiKV instance goes down, the cluster remains operational.
The majority rule principle states that as long as more than half are alive, service can be provided.
If one node goes down, TiKV can read normally, but writing will be abnormal.
Does anyone have production experience? Why does everyone say something different?
For 3 replicas, a maximum of 1 TiKV is allowed to fail; for 5 replicas, a maximum of 2 TiKVs are allowed to fail. More strictly speaking, it allows 1 or 2 replicas of a region to fail. When the number of failed TiKVs hosting a region exceeds half, the region cannot provide service.
If one out of three instances goes down, it can still function normally.
It can work normally, and as long as the majority of the regions are alive, it can still function. However, QPS and TPS will be temporarily affected when a failure occurs. Additionally, as long as at least one copy of the region is alive, data can be recovered.
So, does the TiDB database design require at least 3 replicas for each region?
This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.