Three-node TiKV, one of the TiKV node servers was formatted, seeking advice on how to recover

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 三节点tikv,其中一个tikv节点服务器被格式化了,求教怎么恢复。

| username: 末0_0想

[TiDB Usage Environment] Quasi-production environment
[TiDB Version] 6.5
[Reproduction Path] Hello everyone! I need some advice on how to recover a TiKV node when one of the three TiKV node servers suddenly fails.
[Encountered Problem] I tried using tiup cluster check /home/tidb/tidb/topology.yaml --user tidb -p -i /home/tidb/.ssh/id_rsa, but I couldn’t perform the check without the TiKV node. Now, I’m not sure how to start repairing this node. I want to ask if it is necessary to scale in and then scale out to recover.
[Attachment: Screenshot/Log/Monitoring]

| username: h5n1 | Original post link

  1. To scale out a TiKV, you need to have available machines.
  2. If an existing TiKV needs to be decommissioned, first use tiup cluster scale-in or pd-ctl store delete to set it to offline status. It will eventually become a tombstone, after which you can use pd-ctl remove-tombstone.
| username: CuteRay | Original post link

TiKV only has three nodes, and one of them has gone down. It is recommended to quickly find a machine to add a new node and restore the cluster first.

| username: 末0_0想 | Original post link

The cluster is normal, and the database is also usable. The main issue is that there are no additional machines available for expansion at the moment. The server provider has only performed initialization operations on the servers.

| username: 末0_0想 | Original post link

Is there a more detailed operation process? I see that your second operation only involves removing the TiKV node. How do you recreate and add it back to the cluster?

| username: 啦啦啦啦啦 | Original post link

There are only 3 TiKVs, and the broken one cannot be forcibly scaled down. Try to scale up one first.

| username: CuteRay | Original post link

Didn’t your server get formatted? Just expand a new TiKV on this formatted server, and then scale down the old one.

| username: 末0_0想 | Original post link

Especially for the new machine environment check, removal, and rejoining the cluster, can you be more specific or provide a guide post?

| username: 啦啦啦啦啦 | Original post link

I don’t think so. You still need to expand a node first. Expanding directly on the local machine will probably result in an error.

| username: 末0_0想 | Original post link

Should I consider scaling down first, then adding the formatted machines back in? I’m mainly worried about IP conflicts or existing configuration conflicts that might trigger other bugs.

| username: CuteRay | Original post link

What I meant was to expand one first, then shrink it. The title says that one of the machines with TiKV was formatted, so just expand with this machine, change the port, and then remove the old one.

| username: TiDBer_jYQINSnf | Original post link

Losing one replica does not affect it.
Changing the IP allows it to be added. If the IP is not changed, the original store must be deleted to start.

| username: 啦啦啦啦啦 | Original post link

First, scale out. With only 2 TiKV nodes left, it’s not possible to scale in this faulty node.

| username: 啦啦啦啦啦 | Original post link

:thinking: Changing the port should be possible.

| username: CuteRay | Original post link

Yes, adding one TiKV to the original machine is equivalent to having two TiKVs mixed on one machine. However, one of them is unusable. Just change the port for the added one.

| username: 末0_0想 | Original post link

Okay, I’ll give it a try. Thank you, everyone.

| username: h5n1 | Original post link

  1. For scaling out, if you are deploying TiKV on the original machine, you need to first scale down the original TiKV or directly use a new port. When scaling out, specify the TiKV configuration, just copy and modify one from the original configuration.
  2. For scaling down operations, refer to the link below:
    专栏 - TiKV缩容下线异常处理的三板斧 | TiDB 社区
| username: 胡杨树旁 | Original post link

Boss, I have a question that I don’t quite understand. If the server is formatted, is the data lost? Can the cluster still be accessed normally? Or does the leader on the faulty server migrate to a normal server the moment the server becomes unavailable, allowing the cluster to still be accessed normally without data loss?

| username: Jellybean | Original post link

Here are a few key points:

  1. If the TiKV node that was formatted had a leader that was being accessed at the time, it would be affected because the request would fail due to the inability to retrieve data. If the request to the TiDB cluster has a retry mechanism, it will generally recover and function normally after retrying. If the request is to another normal TiKV node, it will proceed normally and will not be affected.

  2. The cluster defaults to 3 replicas, so if only one TiKV node is formatted, there are still 2 replicas left, and the data will not be lost. Within a relatively short period, the other two replicas can normally elect a leader and continue to provide service.

Therefore, the cluster as a whole remains accessible, and the data in the cluster will not be lost. There will only be some jitter in access at the moment the node goes down. TiDB’s self-healing capability is quite strong.

| username: 胡杨树旁 | Original post link

Got it, thanks.