[Urgent] What to do when all TiKV nodes in the TiDB cluster are in offline status

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 【十万火急】tidb集群的tikv全部处于offline状态怎么办

| username: ablewang_xiaobo

【TiDB Usage Environment】Production Environment
【TiDB Version】v4.0.9
【Encountered Problem】All TiKV nodes in the TiDB cluster are in offline status, what should I do?
【Reproduction Path】Tried restarting the cluster, but the issue was not resolved
【Problem Phenomenon and Impact】
All TiKV nodes in the TiDB cluster are in offline status, how to repair the cluster

| username: tidb狂热爱好者 | Original post link

It’s okay, you can start the TiKV machine with:
Systemctl start tiki-20160.service

| username: ablewang_xiaobo | Original post link

My TiKV service is running normally, but its status is offline.

| username: songxuecheng | Original post link

Use pd-ctl to check the status.

| username: xfworld | Original post link

Have you confirmed that the network is connected?
Check the status of each node through the dashboard.

| username: db_user | Original post link

What operations led to this state? Check the PD logs, TiFlash logs, and TiKV logs.

| username: h5n1 | Original post link

He is showing offline with tiup display, but the actual TiKV service is normal, right? Try tiup start cluster xxx -R tikv to check.

| username: ablewang_xiaobo | Original post link

Through this attempt, it was also useless. In the end, I solved it by adding other machines to the TiKV cluster to transfer the leader and region away.

| username: h5n1 | Original post link

Did the original TiKV handle anything after adding a few more TiKVs?

| username: Running | Original post link

Is the network between PD and KV smooth?

| username: xfworld | Original post link

It can still migrate data, which means the status is OK, unless the heartbeat of the store and region is lost…

| username: ablewang_xiaobo | Original post link

When TiKV is in the offline state, its status is shown as “Offline” in the dashboard. Performing a scale-in on a machine in the offline state will change its status to pending offline. After adding other machines to the TiKV cluster, the machines in the offline state immediately transfer all their regions to the newly added machines. Once the transfer is complete, the machines in the offline state change to the tombstone state. Later, I ran the prune command, and those machines were removed from the cluster.

As for why they were in the offline state, I suspect it might be because I didn’t successfully remove TiFlash. I manually executed a series of commands, here are a few:

tiup ctl:v4.0.9 pd -u http://*.*.*.*:2379 store delete 1
tiup ctl:v4.0.9 pd -u http://*.*.*.*:2379 store delete 4
tiup ctl:v4.0.9 pd -u http://*.*.*.*:2379 store delete 5
| username: ablewang_xiaobo | Original post link

I personally think that although those TiKV nodes are in an offline state, they are still providing services because there are no other nodes to take over their regions. Therefore, they remain in the “offline” state.

| username: ablewang_xiaobo | Original post link

At that time, the data could be accessed normally, and I even successfully performed a full backup of the database.

| username: ablewang_xiaobo | Original post link

Of course, in my case above, I successfully backed up the data when TiKV was in the offline state. Actually, restoring the data might be faster. Adding new nodes and letting the system automatically migrate regions takes a lot of time. If you are migrating regions, it is best to adjust the leader-schedule-limit and region-schedule-limit using the pd-ctl command to speed up region migration.

| username: h5n1 | Original post link

When TiKV is in the offline state, its status is already “offline” when viewed on the dashboard. If you perform a scale-in on a machine in the offline state, it will change to the pending offline state. After adding other machines to the TiKV cluster, the machines in the offline state immediately transfer all their regions to the newly added machines. Once the transfer is complete, the machines in the offline state change to the tombstone state. Later, I ran the prune command, and those machines were removed from the cluster.

As for why they were in the offline state, I guess it might be because I didn’t successfully remove TiFlash. I manually executed a series of commands, here are a few:

tiup ctl:v4.0.9 pd -u http://*.*.*.*:2379 store delete 1
tiup ctl:v4.0.9 pd -u http://*.*.*.*:2379 store delete 4
tiup ctl:v4.0.9 pd -u http://*.*.*.*:2379 store delete 5

Are these deletes for TiKV? The pd store delete command is the offline process. It’s just that you performed this operation on all TiKVs without having extra TiKVs to receive the transferred regions, so the offline state kept appearing but there was still a Leader providing service.

| username: tidb狂热爱好者 | Original post link

This is due to network disconnection. Once the network is restored, it will automatically reconnect.

| username: ablewang_xiaobo | Original post link

This diagram clearly illustrates the state changes of TiKV in my environment.

| username: cs58_dba | Original post link

It feels like the network is down. As long as the cluster doesn’t split-brain, it’s fine.

| username: HACK | Original post link

It seems like there is a network issue, causing the status to be unreachable.