Unable to scale down after TiKV decommissioning

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv下线无法缩容掉

| username: 普罗米修斯

[TiDB Usage Environment] Production Environment
[TiDB Version] v3.0.3
[Encountered Problem] Two TiKV nodes have been offline for 2 days, and now the two TiKV nodes are stuck at 40 and 51 stores respectively and cannot be taken offline.
[Resource Configuration]
2 tidb-servers, 3 pd-servers, 8 tikv-servers


The two TiKV nodes that cannot be taken offline (170150, 170152)

Image
Image
After increasing the region-schedule-limit and max-pending-peer-count, the nodes still cannot be taken offline. I plan to transfer the regions in the offline TiKV nodes (170150, 170152) to force them offline. I selected a region in the offline TiKV and performed the following operations:
operator remove 61557
operator add transfer-region 61557 170151 141001 141002
It prompted success, but the region was not transferred. After filtering, I found that the operator operation stayed in the queue for a long time;

After waiting for a while, the operator operation disappeared from the queue, and region 61557 was still not transferred.
Image

| username: 普罗米修斯 | Original post link

Both stores are in an offline state.

| username: tidb菜鸟一只 | Original post link

Is the cluster usable now?

| username: 普罗米修斯 | Original post link

The cluster is still not working, still getting [Err] 9005 - Region is unavailable. A few days ago, the community suggested trying to take TiKV offline, but it got stuck when doing so.

| username: Jolyne | Original post link

You can check if this works.

| username: redgame | Original post link

What configuration is provided for version 3?

| username: 普罗米修斯 | Original post link

This doesn’t work. After the operation, it still hasn’t completed the offline process.

| username: Jolyne | Original post link

| username: 普罗米修斯 | Original post link

I tried the three tricks before, executed unsafe many times, and the number of regions did not go down. This is the region of offline tikv170150 170152.
170150.log (14.7 KB)
170152.log (15.6 KB)
Manual operator also doesn’t work,
Checked the regions that haven’t been migrated


It has been evicting the leader and the execution has been timing out,

Also, adding peers has always been unsuccessful

| username: 普罗米修斯 | Original post link

Hybrid deployment

| username: Fly-bird | Original post link

Did you solve it?

| username: 普罗米修斯 | Original post link

Problem solved.

| username: 普罗米修斯 | Original post link

Clear the PD cache to allow regions to regenerate replicas and repair the index.

| username: h5n1 | Original post link

What about the region that records the unsafe storeid?

| username: 普罗米修斯 | Original post link

After clearing the cache, some parts are automatically generated. For those that are not generated, recreate them on the respective store node. Repair the index and delete and rebuild the index.

| username: h5n1 | Original post link

Add a third machine.

| username: 普罗米修斯 | Original post link

However, it’s the client’s machine. We can only make suggestions and see if they accept them.

| username: h5n1 | Original post link

Clearing the PD cache processes all regions, not just the remaining few.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.