When encountering a "bad-ssts" issue and finding that the file does not exist without any relevant prompts, how can it be resolved?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 在做bad-ssts的时候发现文件不存在,没有给出相关提示,需如何操作才能恢复呢?

| username: LBX流鼻血

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] 6.1
[Encountered Problem: Problem Phenomenon and Impact]
Querying the table often results in the error “ERROR 9005 (HY000): Region is unavailable”. When performing bad-ssts, it was found that the file does not exist and no relevant prompt was given. How should I proceed to fix this?

start to print bad ssts; data_dir:/tidata/hdap/ti-kv/data-20173; db:/tidata/hdap/ti-kv/data-20173/db

corruption info:
/tidata/hdap/ti-kv/data-20173/db/181752.sst: IO error: No such file or directory While opening a file for random read: /tidata/hdap/ti-kv/data-20173/db/181752.sst: No such file or directory

sst meta:
sst 181752 is not found in manifest: Error in processing file /tidata/hdap/ti-kv/data-20173/db/MANIFEST-157104 NotFound: sst 181752 is not in the live files set of the manifest

| username: h5n1 | Original post link

Scale down and then scale up again, there’s no need to mess around.

| username: LBX流鼻血 | Original post link

I have already scaled up and down once, but it’s still like this :cold_sweat:

| username: redgame | Original post link

How did you do it? This is a big move.

| username: Fly-bird | Original post link

First remove the newly expanded nodes, and then re-expand.

| username: 大飞哥online | Original post link

Did expanding the capacity make the file disappear? :joy:

| username: h5n1 | Original post link

After an expansion, files are still missing? Should we check if there are any issues with the file system or the disk?

| username: 像风一样的男子 | Original post link

You can check out this column

| username: h5n1 | Original post link

Where did this error message come from? Check the region status:
pd-ctl region xxxx
tikv-ctl --host xxx:20160 raft region -r XXXX

| username: LBX流鼻血 | Original post link

This error occurs when querying a table or performing a count(), previously it reported “Region is unavailable”, now it reports a long string like this:
/
SQL Error (1105): no available peers, region: {id:733765 start_key:“t\200\000\000\000\000\000\005j_i\200\000\000\000\000\000\000\001\003\200\000\000\000\004BD2” end_key:“t\200\000\000\000\000\000\005j_i\200\000\000\000\000\000\000\001\003\200\000\000\000\004M_\232” region_epoch:<conf_ver:557 version:7266 > peers:<id:733768 store_id:18 > peers:<id:3086601 store_id:8 > peers:<id:14470098 store_id:7 > } */

Now I am expanding and shrinking the problematic nodes one by one, and then trying to check again.

| username: 像风一样的男子 | Original post link

Generally, this error indicates a disk issue. You can try to repair it. If it can’t be repaired, locate the problematic replica in the specific region and remove all KVs from that replica.