Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: 三节点tikv,其中一个tikv节点服务器被格式化了,求教怎么恢复。
[TiDB Usage Environment] Quasi-production environment
[TiDB Version] 6.5
[Reproduction Path] Hello everyone! I need some advice on how to recover a TiKV node when one of the three TiKV node servers suddenly fails.
[Encountered Problem] I tried using tiup cluster check /home/tidb/tidb/topology.yaml --user tidb -p -i /home/tidb/.ssh/id_rsa
, but I couldn’t perform the check without the TiKV node. Now, I’m not sure how to start repairing this node. I want to ask if it is necessary to scale in and then scale out to recover.
[Attachment: Screenshot/Log/Monitoring]
TiKV only has three nodes, and one of them has gone down. It is recommended to quickly find a machine to add a new node and restore the cluster first.
The cluster is normal, and the database is also usable. The main issue is that there are no additional machines available for expansion at the moment. The server provider has only performed initialization operations on the servers.
Is there a more detailed operation process? I see that your second operation only involves removing the TiKV node. How do you recreate and add it back to the cluster?
There are only 3 TiKVs, and the broken one cannot be forcibly scaled down. Try to scale up one first.
Didn’t your server get formatted? Just expand a new TiKV on this formatted server, and then scale down the old one.
Especially for the new machine environment check, removal, and rejoining the cluster, can you be more specific or provide a guide post?
I don’t think so. You still need to expand a node first. Expanding directly on the local machine will probably result in an error.
Should I consider scaling down first, then adding the formatted machines back in? I’m mainly worried about IP conflicts or existing configuration conflicts that might trigger other bugs.
What I meant was to expand one first, then shrink it. The title says that one of the machines with TiKV was formatted, so just expand with this machine, change the port, and then remove the old one.
Losing one replica does not affect it.
Changing the IP allows it to be added. If the IP is not changed, the original store must be deleted to start.
First, scale out. With only 2 TiKV nodes left, it’s not possible to scale in this faulty node.
Changing the port should be possible.
Yes, adding one TiKV to the original machine is equivalent to having two TiKVs mixed on one machine. However, one of them is unusable. Just change the port for the added one.
Okay, I’ll give it a try. Thank you, everyone.
Boss, I have a question that I don’t quite understand. If the server is formatted, is the data lost? Can the cluster still be accessed normally? Or does the leader on the faulty server migrate to a normal server the moment the server becomes unavailable, allowing the cluster to still be accessed normally without data loss?
Here are a few key points:
-
If the TiKV node that was formatted had a leader that was being accessed at the time, it would be affected because the request would fail due to the inability to retrieve data. If the request to the TiDB cluster has a retry mechanism, it will generally recover and function normally after retrying. If the request is to another normal TiKV node, it will proceed normally and will not be affected.
-
The cluster defaults to 3 replicas, so if only one TiKV node is formatted, there are still 2 replicas left, and the data will not be lost. Within a relatively short period, the other two replicas can normally elect a leader and continue to provide service.
Therefore, the cluster as a whole remains accessible, and the data in the cluster will not be lost. There will only be some jitter in access at the moment the node goes down. TiDB’s self-healing capability is quite strong.