TiKV Scaling Down Stuck in Pending Offline State, Error 9005 Occurs When Querying Data from Certain Tables

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv缩容出现状态一直是Pending Offline,目前发现某些表数据查不到报9005错误

| username: 艾维iii

When scaling down TiKV, the status remains Pending Offline. Currently, some table data cannot be queried, resulting in error 9005.

| username: dba远航 | Original post link

Check the relevant logs to see the details.

| username: xfworld | Original post link

Check if the region leader has not been fully transferred yet, and if there are still some on the node that is about to be taken offline.

| username: 艾维iii | Original post link

How to check… I’m not very good at it…

| username: 艾维iii | Original post link

The image you uploaded is not visible. Please provide the text content that you need translated.

| username: songxuecheng | Original post link

Pending is still offline, check the region to see.

| username: WalterWj | Original post link

:thinking: During the offline process, the corresponding offline node needs to be alive. From this heartbeat, it seems like you have already deleted it?

| username: 艾维iii | Original post link

The image you provided is not accessible. Please provide the text content you need translated.

| username: 艾维iii | Original post link

After restarting the cluster, querying the table now reports error 9002, and the TiKV node is still offline.

| username: WalterWj | Original post link

You can check the region health in the PD monitoring to confirm the data status first.

| username: 艾维iii | Original post link

| username: 艾维iii | Original post link

The image is not visible. Please provide the text you need translated.

| username: WalterWj | Original post link

Is there monitoring? This looks like a lot of content.

| username: 艾维iii | Original post link

There are quite a few, with over 8 billion records in the database, there must be a lot of stuff.

| username: 小龙虾爱大龙虾 | Original post link

How did you do it? Share the steps so we can take a look.

| username: tidb菜鸟一只 | Original post link

Normally, taking one TiKV node offline should not affect the cluster’s usage. Moreover, I see that there are no leader nodes on the corresponding node, so it should not impact the business. Please share the Grafana region health screenshot, similar to the one below:

| username: 艾维iii | Original post link

I found the reason. Due to the communication frequency issue, the IP was blocked by the security group, causing TiDB to be unable to access several TiKV nodes. It is not affecting usage now.

| username: 艾维iii | Original post link

The image you provided is not visible. Please provide the text you need translated.

| username: tidb菜鸟一只 | Original post link

That’s fine, then just wait for this node to go offline normally, or directly use --force to take it offline.

| username: xfworld | Original post link

This issue is quite powerful… :see_no_evil: