After Node Offline, It Remains in Pending Offline Status and Does Not Disappear Even After 1 Day

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 节点下线后,处于Pending Offline无法消失,等了1天

| username: LBX流鼻血

[TiDB Usage Environment] Production
[TiDB Version] 6.1.0
[Encountered Problem: Phenomenon and Impact] A node encountered an issue and after being taken offline, it remains in Pending Offline status. Upon checking the relevant information, the size is already zero, but the region_count is still 3499. This Pending Offline status just won’t disappear. How can I make it disappear?

| username: h5n1 | Original post link

Region migration is slow, as long as it keeps decreasing, there will be no problem.

| username: LBX流鼻血 | Original post link

This 3499 has been stuck for a very long time.

| username: h5n1 | Original post link

Read the article and manually schedule first.

| username: Kongdom | Original post link

You can open the Grafana site to check the progress of pending offline. It is possible that the data is too large and has not been completed yet. I have experienced a situation where it was pending for a day. There are parameters that can be adjusted to speed it up.

| username: tidb菜鸟一只 | Original post link

Without a leader, can we be bold and use --force to take it offline directly?

| username: LBX流鼻血 | Original post link

@h5n1 @tidb newbie Yes, but we don’t dare to do this in production. The region_count is stuck at 3499, and we can’t manually schedule it. They are all empty.

| username: Billmay表妹 | Original post link

Check out this article: 专栏 - tikv下线Pending Offline卡住排查思路 | TiDB 社区

| username: h5n1 | Original post link

Region migration and Leader migration are two different things. Even if the leader is 0, you still have to wait for the region_count to be 0. What needs to be done now is to manually trigger the region migration action. You still need to refer to the previous documentation. The --force option only removes the node from tiup, but it won’t actually delete it from the real cluster. Don’t use it lightly.

| username: redgame | Original post link

There have been instances where the PD cluster’s abnormal status might cause nodes to fail to go offline properly.

| username: LBX流鼻血 | Original post link

All values are 0, but the status is still Pending Offline. Additionally, I tried to check these two KV-SERVER processes, and they keep restarting infinitely: down, restart, down. The logs continuously output errors. Please help take a look, thank you very much.

| username: Jellybean | Original post link

Have you tried the method mentioned by h5n1 here? Check it out and see.

专栏 - TiKV缩容下线异常处理的三板斧 | TiDB 社区.

| username: h5n1 | Original post link

It seems like there’s a separate thread for your TiKV restart issue. You can stop these two and use the online unsafe recover feature in version 6.1.

| username: LBX流鼻血 | Original post link

Thank you, boss. Online unsafe recover is very useful.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.