After expanding TiKV capacity, the 3 TiKV nodes remain in an election state, and TiDB cannot start up normally

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv 扩所容以后,tikv 3个节点一直处于选举状态,且tidb 也无法正常up

| username: yxf7980

[Test Environment for TiDB] Testing
[TiDB Version] 7.1.0
[Reproduction Path] Alternately take down TiKV nodes with very short intervals, then bring them all back online.
[Encountered Problem] TiKV remains in an election state, and TiDB cannot start normally, indicating that the wait time is too long. After taking down one node and restarting the cluster, the following cluster status appears.
[Resource Configuration]

After restarting the cluster, other monitoring plugins also failed to start.

| username: linnana | Original post link

The region_id value involved in the election keeps changing according to the logs.

| username: 小龙虾爱大龙虾 | Original post link

:+1: How did it end up like this?

| username: zhanggame1 | Original post link

Is one of the TiKVs not up? Is there an error reported? Check the TiKV logs and look at the error-level logs.

| username: Jellybean | Original post link

You need to check the running logs of each component, identify any abnormal areas, and post them here.

| username: dba远航 | Original post link

There is an election parameter that can be adjusted, but I can’t remember the exact name.

| username: 江湖故人 | Original post link

How many nodes should be expanded from 3 nodes?

| username: 江湖故人 | Original post link

Generally, 3 expand to 5, 4 might cause a split-brain situation :thinking:

| username: yxf7980 | Original post link

My colleague has been continuously shutting down and starting nodes, and now it’s like this. After coming to work today, I found that TiDB is normal, but the node I used the scale-in command on is still in the pending offline state… It’s been a long wait. Now I’m checking if the database is available.

| username: yxf7980 | Original post link

The monitoring plugin is still there, and the node I tried to scale down hasn’t successfully gone offline yet… I’m now trying to forcibly shut down the process to see what happens.

| username: yxf7980 | Original post link

Just keep shutting down and starting up, that’s it… It seems that all three nodes have undergone shutdown and startup operations. The developers asked to do this, and they operated like this during the data compression process. The main issue is that the rebalancing is not yet complete, and they keep operating like this.

| username: yxf7980 | Original post link

There is an error reported on PD.

| username: zhanggame1 | Original post link

What is the current status of tiup cluster display?

| username: 小龙虾爱大龙虾 | Original post link

I see that you have quite a few components down, including monitoring-related components. Normally, these components have nothing to do with the cluster. You should check the logs to see why they are not starting up. Also, the TiKV component does not have elections. The TiDB cluster performs Raft replication at the region level, and a region election failure should not cause the TiKV process to fail. So you still need to check why TiKV is not starting up.

| username: tidb菜鸟一只 | Original post link

It seems that one of your TiKV nodes has an issue. Try scaling out by adding a TiKV node. Additionally, monitoring those nodes should be fine. Try restarting it individually and see.

| username: 有猫万事足 | Original post link

You only have 3 TiKVs, and one of them is pending offline.
With 3 replicas, there will definitely be constant elections.
You need to add one more TiKV before you can scale down the pending offline one.

| username: 麻烦是朋友 | Original post link

Check in PD to see if there are any changes in the partitions.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.