After expanding TiKV capacity, the 3 TiKV nodes remain in an election state, and TiDB cannot start up normally

[Test Environment for TiDB] Testing
[TiDB Version] 7.1.0
[Reproduction Path] Alternately take down TiKV nodes with very short intervals, then bring them all back online.
[Encountered Problem] TiKV remains in an election state, and TiDB cannot start normally, indicating that the wait time is too long. After taking down one node and restarting the cluster, the following cluster status appears.
[Resource Configuration]

After restarting the cluster, other monitoring plugins also failed to start.

The region_id value involved in the election keeps changing according to the logs.

:+1: How did it end up like this?

Is one of the TiKVs not up? Is there an error reported? Check the TiKV logs and look at the error-level logs.

You need to check the running logs of each component, identify any abnormal areas, and post them here.

There is an election parameter that can be adjusted, but I can’t remember the exact name.

How many nodes should be expanded from 3 nodes?

Generally, 3 expand to 5, 4 might cause a split-brain situation :thinking:

My colleague has been continuously shutting down and starting nodes, and now it’s like this. After coming to work today, I found that TiDB is normal, but the node I used the scale-in command on is still in the pending offline state… It’s been a long wait. Now I’m checking if the database is available.

The monitoring plugin is still there, and the node I tried to scale down hasn’t successfully gone offline yet… I’m now trying to forcibly shut down the process to see what happens.

Just keep shutting down and starting up, that’s it… It seems that all three nodes have undergone shutdown and startup operations. The developers asked to do this, and they operated like this during the data compression process. The main issue is that the rebalancing is not yet complete, and they keep operating like this.

There is an error reported on PD.

What is the current status of tiup cluster display?

I see that you have quite a few components down, including monitoring-related components. Normally, these components have nothing to do with the cluster. You should check the logs to see why they are not starting up. Also, the TiKV component does not have elections. The TiDB cluster performs Raft replication at the region level, and a region election failure should not cause the TiKV process to fail. So you still need to check why TiKV is not starting up.

It seems that one of your TiKV nodes has an issue. Try scaling out by adding a TiKV node. Additionally, monitoring those nodes should be fine. Try restarting it individually and see.

You only have 3 TiKVs, and one of them is pending offline.
With 3 replicas, there will definitely be constant elections.
You need to add one more TiKV before you can scale down the pending offline one.

Check in PD to see if there are any changes in the partitions.

