Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: tikv下线无法缩容掉
[TiDB Usage Environment] Production Environment
[TiDB Version] v3.0.3
[Encountered Problem] Two TiKV nodes have been offline for 2 days, and now the two TiKV nodes are stuck at 40 and 51 stores respectively and cannot be taken offline.
[Resource Configuration]
2 tidb-servers, 3 pd-servers, 8 tikv-servers
The two TiKV nodes that cannot be taken offline (170150, 170152)

After increasing the region-schedule-limit and max-pending-peer-count, the nodes still cannot be taken offline. I plan to transfer the regions in the offline TiKV nodes (170150, 170152) to force them offline. I selected a region in the offline TiKV and performed the following operations:
operator remove 61557
operator add transfer-region 61557 170151 141001 141002
It prompted success, but the region was not transferred. After filtering, I found that the operator operation stayed in the queue for a long time;
After waiting for a while, the operator operation disappeared from the queue, and region 61557 was still not transferred.
Both stores are in an offline state.
Is the cluster usable now?
The cluster is still not working, still getting [Err] 9005 - Region is unavailable. A few days ago, the community suggested trying to take TiKV offline, but it got stuck when doing so.
You can check if this works.
What configuration is provided for version 3?
This doesn’t work. After the operation, it still hasn’t completed the offline process.
I tried the three tricks before, executed unsafe many times, and the number of regions did not go down. This is the region of offline tikv170150 170152.
170150.log (14.7 KB)
170152.log (15.6 KB)
Manual operator also doesn’t work,
Checked the regions that haven’t been migrated
It has been evicting the leader and the execution has been timing out,
Also, adding peers has always been unsuccessful
Clear the PD cache to allow regions to regenerate replicas and repair the index.
What about the region that records the unsafe storeid?
After clearing the cache, some parts are automatically generated. For those that are not generated, recreate them on the respective store node. Repair the index and delete and rebuild the index.
However, it’s the client’s machine. We can only make suggestions and see if they accept them.
Clearing the PD cache processes all regions, not just the remaining few.
This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.