Error when running tikv-client: peer is not leader for region 9701, leader may be None

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv-client 运行时报错 peer is not leader for region 9701, leader may None

| username: Doslin

[TiDB Usage Environment] Production Environment

[TiDB Version]
v6.1.0

[Reproduction Path] Operations performed that led to the issue
Initially, there were three nodes: 10.29.0.158:20160, 10.29.0.20:20160, and 10.29.0.21:20160. Later, because the disk on 10.29.0.20 needed to be replaced, the cluster’s max-replicas was set to 2.

tiup ctl:v6.1.0 pd -i -u http://10.29.0.20:2379
config set max-replicas 2
tiup cluster scale-in cluster_name --node 10.29.0.20:20160
tiup cluster scale-in cluster_name --node 10.29.0.20:2379
tiup cluster scale-in cluster_name --node 10.29.0.20:9093
tiup cluster scale-in cluster_name --node 10.29.0.20:3000
tiup cluster scale-in cluster_name --node 10.29.0.20:9090
After 2 minutes, the disk was replaced.

Then, these processes were added back.
tiup cluster scale-out cluster_name --node 10.29.0.20:20160
tiup cluster scale-out cluster_name --node 10.29.0.20:2379
tiup cluster scale-out cluster_name --node 10.29.0.20:9093
tiup cluster scale-out cluster_name --node 10.29.0.20:3000
tiup cluster scale-out cluster_name --node 10.29.0.20:9090
tiup ctl:v6.1.0 pd -i -u http://10.29.0.20:2379
config set max-replicas 3
Replicas were changed to 3.

However, no matter what I did with 10.29.0.20:20160, it remained offline. Even using scale-in -force resulted in errors. So, I added an instance 10.29.0.20:20161 to form three replicas.

During this period, I scaled-in and scaled-out all three nodes 10.29.0.20:2379, 10.29.0.21:2379, and 10.29.0.158:2379, but it still didn’t work.

Later, I found a suggestion to restart tikv, but the command to restart also failed.
tiup cluster reload cluster_name --node 10.29.0.158:20160
tiup cluster reload cluster_name --node 10.29.0.21:20160

Reload cluster_name --node 10.29.0.158:20160 error


Detailed debug log

Checking the process status of 10.29.0.21:20160, I found that its pd is still very old. Could this be related?
ps -ef | grep tikv-server
--pd 10.29.0.19:2379,10.29.0.158:2379

What should I do next?

[Encountered Issue: Symptoms and Impact]
tikv-rust client error

10.29.0.21:20160 process error log

10.29.0.20:20160 process error log

[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

| username: Fly-bird | Original post link

Can the service be started manually by restarting it on the TiKV server?

| username: Doslin | Original post link

@Fly-bird
How do I manually restart tikv-server? I haven’t seen any documentation on this. What should I be aware of when restarting?

| username: h5n1 | Original post link

There was an error in the scale-in process. After scaling in, you need to wait for the region to migrate and become a tombstone. You can refer to the following methods for handling this:

| username: TiDBer_小阿飞 | Original post link

What is the final state of TIKV when it goes offline? Is it caused by incomplete offline process?

| username: Doslin | Original post link

It seems that it hasn’t been completely taken offline, and the status appears to be Pending Offline. It might be caused by the incomplete offline process of tikv-server. How can this be resolved? @TiDBer_小阿飞

| username: 像风一样的男子 | Original post link

Isn’t there a solution in the link above?

| username: Doslin | Original post link

Oh, I previously saw in the documentation that it requires restarting the tikv-server. @像风一样的男子

| username: 像风一样的男子 | Original post link

Don’t you still have issues with KV? What does it have to do with PD?

| username: Doslin | Original post link

Thank you, master. I followed the operation documentation, but there are still issues. Seeking guidance.

The region migration has also completed, but the old store in PD has not disappeared. Seeking guidance @像风一样的男子

When displaying the cluster status, it is also Offline

| username: 像风一样的男子 | Original post link

Your operation has significant issues, and the order is incorrect. First, try forcibly scaling in the downed node:

tiup cluster scale-in xx --node xx:20160 --force

| username: Doslin | Original post link

Will force delete clear the information in Information_schema? @像风一样的男子

tiup ctl:v6.1.0 pd -u http://ip:2379 region store 4979 > cw.log I used this command and saw there are still 100 regions.

Looking at the above image, there are no regions.

| username: 像风一样的男子 | Original post link

Then wait for the region to become 0 and see if the store status changes to Tombstone.
After that, use tiup cluster prune xxx to clean up this KV.

| username: Doslin | Original post link

From the store command, the region has already become 0, and it became 0 the day before yesterday.

| username: 像风一样的男子 | Original post link

At this time, it is best to scale out a KV node.

| username: 像风一样的男子 | Original post link

Look at the operations in that column.

| username: Doslin | Original post link

A node has already been added, besides this offline one, there are three up nodes.

| username: h5n1 | Original post link

After finding the region_id, try adding manual scheduling to remove it from this store. The link to the article is provided earlier.

| username: Doslin | Original post link

Finally, using unsafe remove-failed-stores 4979 resolved the issue.

| username: Doslin | Original post link

Thank you @h5n1 @Man Like the Wind, you two experts :pray: