Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: 缩容tikv快24小时了,还是下线中
【TiDB Usage Environment】Production Environment
【TiDB Version】5.0.1
【Reproduction Path】Operations performed that led to the issue
【Encountered Issue: Issue Phenomenon and Impact】
【Resource Configuration】
【Attachment: Screenshot/Log/Monitoring】
raftdb.info log
tiup or k8s?
pd-ctl store
Check if the offline store still has regions.
How many nodes are taken offline to how many nodes.
Three nodes were reduced to two. Actually, this one had its system reinstalled. After the cluster was restored, this node has never been online, so we plan to scale down and then scale up to see if it works. That’s the background.
You cannot scale down from 3 nodes to 2, as it does not meet the minimum requirement of 3 replicas. You need to scale up first and then scale down.
The internal mechanism of TiDB is three replicas. You should add a node first, and then proceed with the operation.
You can try this command for nodes that are already offline: tiup cluster scale-in --force clustername --node xx.xx.xx.xx:port
.
Experts are indeed experts, they immediately pinpointed the issue. I didn’t expect it to be 3 compressed into 2…
Four nodes have been offline for over 12 hours and are still in the process of going offline.
Then we won’t expect it, unless the expansion is not enough to supplement the data
It shouldn’t be.
Please send the output information of the pd-ctl store
command.
The previously offline node is now mysteriously back online. Initially, I was considering scaling down and then scaling it back up because it had been offline for a while. This is quite a magical occurrence! Thanks to everyone for the series of suggestions.
If there are three nodes and one goes down, how long will it take for problems to arise if no new nodes are added?
There won’t be any issues for a long time, until another one goes down, at which point it won’t be able to provide service, and you’ll have to wait for the node to restart.
If one of the three nodes is down, the remaining two can still provide service normally. Why is it that when three nodes are scaled down to two, it doesn’t work?
With 3 node instances, if you scale down to 2, it will not meet the minimum requirements of the Raft protocol, and the necessary 3 replicas will not be achieved.
For example, if one of the three node instances is down, isn’t there only one left? Why can it still provide normal service? What is the difference between scaling down and being down?
The default setting on the PD side is 3 replicas. After 2 replicas are committed, the data is considered ready to commit. If one goes down, the remaining 2 must be fully committed to continue running. In practice, it can still run, but why can’t it scale back down? Because it’s restricted by TiUP or the operator in Kubernetes. A failure is an exception, unavoidable, and an act of God. However, why would you want to scale down to a very dangerous state? Therefore, both the operator and Kubernetes prevent this.
- The default value of
replication.max-replicas
in TiDB is 3. Therefore, if you want to take down 2 out of the 3 members of a region, TiDB will not accept such a request. You can check the TiDB log information for more details.
- If you are trying to take down one instance out of the normal 3 members and it takes a long time to complete, you need to check the following parameters to see if there are any performance limitations:
region-schedule-limit
replica-schedule-limit
max-snapshot-count
store_limit
Additionally, there have been cases where the actual KV storage was insufficient, or the storage-related capacity set in KV limited the migration progress.