Abnormal TiKV Startup

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv 启动异常

| username: rebelsre

【TiDB Usage Environment】Production Environment / Testing / Poc
【TiDB Version】4.0.10
【Reproduction Path】
【Encountered Problem: Problem Phenomenon and Impact】
TiKV startup exception, error as follows:

[2024/05/27 16:07:51.797 +08:00] [INFO] [store.rs:925] ["region is applying snapshot"] [store_id=4] [region="id: 25193 start_key: 7480000000000000FF365F698000000000FF0000040380000000FF0000000004000000FF006082F5FF038000FF000004EEBEC50000FD end_key: 7480000000000000FF365F698000000000FF0000040380000000FF0000000004000000FF0060830014038000FF00000523062B0000FD region_epoch { conf_ver: 1325 version: 522 } peers { id: 12589504 store_id: 5 } peers { id: 12634220 store_id: 4 } peers { id: 13778883 store_id: 2134051 }"]
[2024/05/27 16:07:51.797 +08:00] [INFO] [peer.rs:180] ["create peer"] [peer_id=12634220] [region_id=25193]
[2024/05/27 16:07:51.799 +08:00] [FATAL] [server.rs:590] ["failed to start node: Other(\"[components/raftstore/src/store/peer_storage.rs:504]: [region 25193] entry at 33014 doesn\\'t exist, may lose data.\")"]

Region information:

» region 25193
{
  "id": 25193,
  "start_key": "7480000000000000FF365F698000000000FF0000040380000000FF0000000004000000FF006082F5FF038000FF000004EEBEC50000FD",
  "end_key": "7480000000000000FF365F698000000000FF0000040380000000FF0000000004000000FF0060830014038000FF00000523062B0000FD",
  "epoch": {
    "conf_ver": 1338,
    "version": 522
  },
  "peers": [
    {
      "id": 13778883,
      "store_id": 2134051
    },
    {
      "id": 17455504,
      "store_id": 5
    }
  ],
  "leader": {
    "id": 13778883,
    "store_id": 2134051
  },
  "written_bytes": 0,
  "read_bytes": 0,
  "written_keys": 0,
  "read_keys": 0,
  "approximate_size": 47,
  "approximate_keys": 661611
}

【Resource Configuration】Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
【Attachments: Screenshots/Logs/Monitoring】
Refer to the document TiKV Control Usage Instructions | PingCAP Archive Documentation Site
Does the remove-peer need to handle all store_ids?
Secondly, executing tikv-ctl --db /path/to/tikv/db tombstone -p 127.0.0.1:2379 -r <region_id> on the faulty TiKV machine results in the following error:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: RocksDb("IO error: No such file or directoryWhile opening a file for sequentially reading: /cloud/data5/tikv-20160/CURRENT: No such file or directory")', src/libcore/result.rs:1188:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
Error: exit status 101
Error: run `/root/.tiup/components/ctl/v4.0.10/ctl` (wd:/root/.tiup/data/UDxZpT0) failed: exit status 1

No CURRENT related directory found on normal nodes.

| username: 小龙虾爱大龙虾 | Original post link

Why is this happening? What is the current state of the entire cluster? Is it just one TiKV that is not working?

| username: rebelsre | Original post link

I don’t know why one TiKV failed to start, the error is as above, all other nodes and components in the cluster are normal.

| username: rebelsre | Original post link

The cluster has 5 TiKV nodes. Now one of them cannot start. Can I directly scale it down and then scale it back up? I tried moving the directory directly, but it still doesn’t start.

| username: xfworld | Original post link

Sure, first shrink then expand, it’s best to do a region leader eviction, manually evict the leader on this bad node… then shrink…

| username: Kongdom | Original post link

The “mv” directory probably doesn’t have sufficient permissions.

| username: tidb菜鸟一只 | Original post link

If the cluster is functioning normally, you can scale down this node and then scale it back up.

| username: zhaokede | Original post link

Mark this, you can refer to it next time you encounter the same problem.

| username: rebelsre | Original post link

Awkward, now there’s this issue with dumpling exporting data. Is there any solution?

| username: 像风一样的男子 | Original post link

Has the scaling up and down completely finished?

| username: tidb狂热爱好者 | Original post link

I feel like your business data is lost.

| username: rebelsre | Original post link

The original process was already down. Scaling down was completed quickly, and scaling up was also finished quickly, but the data appears to be very unbalanced.


Additionally, TiKV frequently becomes inaccessible, which seems to be causing the errors. This doesn’t happen every time.

| username: 像风一样的男子 | Original post link

You have only executed the command, but the regions have not been fully migrated. The expansion is considered complete only after all regions have been fully migrated and data balance has been achieved.

| username: rebelsre | Original post link

Is there any way to speed up this region migration? It seems to be running very slowly automatically.

| username: Kongdom | Original post link

You can refer to the method in the column