Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: tikv离线起不来
[TiDB Usage Environment] Production Environment
[TiDB Version] 6.2.0
[Encountered Problem: Symptoms and Impact]
The TiKV in the online environment went offline and could not be started after using tiup cluster.
Rebuild this node, it’s the simplest and most reliable method.
Scaling up or down, there’s no better way.
Your version 6.2 is a DMR version and not suitable for production use.
Check if the node’s disk space is full or if there are no write permissions?
Doesn’t the appearance of “welcome” mean that it has started up?
“Welcome” is the first sentence, and “ready to serve” indicates that it has truly started up.
In this case, it is continuously restarting.
It doesn’t have much impact, just shrink it and expand it again.
Following, it’s really scary to perform scaling down in production.
I hadn’t noticed this detail before~
No big deal, shrink the node and then expand it again.
Was the offline time too long? Did it cause the log to be overwritten?
Focus on solving the problem.
You can try this: first perform a hard backup of the operating system, then reduce and expand the capacity on the hard backup. It’s safer this way.
Is it really that fragile? Scaling up and down seems quite troublesome.
First scale down, then scale up.
I tried to shrink it, but it’s been two hours and it’s still not done. Is it broken and unable to shrink?
How many nodes do you have in total? Don’t tell me you only have three nodes and one of them is down?
No, before you shrink, you need to check if the number of nodes is sufficient. If not, you need to expand first, then shrink.
Even if it’s broken, it can still shrink. When you execute store, the regions on it will gradually decrease. Other TiKV nodes will replenish the replicas.
If there is still a leader on it, then your cluster has a problem.