Abnormal TiKV Startup

TiKV startup exception, error as follows:

[2024/05/27 16:07:51.797 +08:00] [INFO] [store.rs:925] ["region is applying snapshot"] [store_id=4] [region="id: 25193 start_key: 7480000000000000FF365F698000000000FF0000040380000000FF0000000004000000FF006082F5FF038000FF000004EEBEC50000FD end_key: 7480000000000000FF365F698000000000FF0000040380000000FF0000000004000000FF0060830014038000FF00000523062B0000FD region_epoch { conf_ver: 1325 version: 522 } peers { id: 12589504 store_id: 5 } peers { id: 12634220 store_id: 4 } peers { id: 13778883 store_id: 2134051 }"]
[2024/05/27 16:07:51.797 +08:00] [INFO] [peer.rs:180] ["create peer"] [peer_id=12634220] [region_id=25193]
[2024/05/27 16:07:51.799 +08:00] [FATAL] [server.rs:590] ["failed to start node: Other(\"[components/raftstore/src/store/peer_storage.rs:504]: [region 25193] entry at 33014 doesn\\'t exist, may lose data.\")"]

Region information:

» region 25193
  "id": 25193,
  "start_key": "7480000000000000FF365F698000000000FF0000040380000000FF0000000004000000FF006082F5FF038000FF000004EEBEC50000FD",
  "end_key": "7480000000000000FF365F698000000000FF0000040380000000FF0000000004000000FF0060830014038000FF00000523062B0000FD",
  "epoch": {
    "conf_ver": 1338,
    "version": 522
  "peers": [
      "id": 13778883,
      "store_id": 2134051
      "id": 17455504,
      "store_id": 5
  "leader": {
    "id": 13778883,
    "store_id": 2134051
  "written_bytes": 0,
  "read_bytes": 0,
  "written_keys": 0,
  "read_keys": 0,
  "approximate_size": 47,
  "approximate_keys": 661611

Does the remove-peer need to handle all store_ids?
Secondly, executing tikv-ctl --db /path/to/tikv/db tombstone -p -r <region_id> on the faulty TiKV machine results in the following error:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: RocksDb("IO error: No such file or directoryWhile opening a file for sequentially reading: /cloud/data5/tikv-20160/CURRENT: No such file or directory")', src/libcore/result.rs:1188:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
Error: exit status 101
Error: run `/root/.tiup/components/ctl/v4.0.10/ctl` (wd:/root/.tiup/data/UDxZpT0) failed: exit status 1

No CURRENT related directory found on normal nodes.

Why is this happening? What is the current state of the entire cluster? Is it just one TiKV that is not working?

I don’t know why one TiKV failed to start, the error is as above, all other nodes and components in the cluster are normal.

The cluster has 5 TiKV nodes. Now one of them cannot start. Can I directly scale it down and then scale it back up? I tried moving the directory directly, but it still doesn’t start.

Sure, first shrink then expand, it’s best to do a region leader eviction, manually evict the leader on this bad node… then shrink…

The “mv” directory probably doesn’t have sufficient permissions.

If the cluster is functioning normally, you can scale down this node and then scale it back up.

Mark this, you can refer to it next time you encounter the same problem.

Awkward, now there’s this issue with dumpling exporting data. Is there any solution?

Has the scaling up and down completely finished?

I feel like your business data is lost.

The original process was already down. Scaling down was completed quickly, and scaling up was also finished quickly, but the data appears to be very unbalanced.

Additionally, TiKV frequently becomes inaccessible, which seems to be causing the errors. This doesn’t happen every time.

You have only executed the command, but the regions have not been fully migrated. The expansion is considered complete only after all regions have been fully migrated and data balance has been achieved.

Is there any way to speed up this region migration? It seems to be running very slowly automatically.

