TiKV cannot find store

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv找不到store

| username: TiDBer_B3GcBY89

【TiDB Usage Environment】Production Environment
【TiDB Version】
【Reproduction Path】
【Encountered Problem: Problem Phenomenon and Impact】
【Resource Configuration】
【Attachments: Screenshots/Logs/Monitoring】
Power outage in the data center caused TiKV errors
[2023/11/21 13:59:44.969 +08:00] [WARN] [endpoint.rs:780] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 9942146, leader may None" not_leader { region_id: 9942146 }”]
[2023/11/21 13:59:45.447 +08:00] [ERROR] [raft_client.rs:796] [“resolve store address failed”] [err_code=KV:Unknown] [err=“Other("[src/server/resolve.rs:100]: unknown error \"[components/pd_client/src/util.rs:878]: invalid store ID 49005, not found\"")”] [store_id=49005]
TiDB service cannot start, and the database cannot start either.

| username: 像风一样的男子 | Original post link

How is the cluster status? Check with tiup display, it looks like there is a disk issue.

| username: zhanggame1 | Original post link

Let’s see how many nodes are displayed.

| username: dba远航 | Original post link

It feels like outdated TiKV (region) information on PD caused this.

| username: TiDBer_B3GcBY89 | Original post link

I tried using unsafe remove-failed-stores with the IDs of 2 stores, which were interrupted at 5/6 and 6/7 respectively. I also tried unsafe-recover remove-fail-stores and recreate-region, but none of them could start :face_with_head_bandage:

| username: TiDBer_B3GcBY89 | Original post link

Originally, there were 5 nodes, including 2 offline nodes. When the power went out, I scaled it down to 3 nodes.

kubectl get -n advanced-tidb tidbcluster advanced-tidb -ojson | jq '.status.tikv.stores'
{
  "1": {
    "id": "1",
    "ip": "advanced-tidb-tikv-0.advanced-tidb-tikv-peer.advanced-tidb.svc",
    "lastTransitionTime": "2023-11-21T05:45:05Z",
    "leaderCount": 113,
    "podName": "advanced-tidb-tikv-0",
    "state": "Up"
  },
  "4": {
    "id": "4",
    "ip": "advanced-tidb-tikv-1.advanced-tidb-tikv-peer.advanced-tidb.svc",
    "lastTransitionTime": "2023-11-21T05:48:04Z",
    "leaderCount": 93,
    "podName": "advanced-tidb-tikv-1",
    "state": "Up"
  },
  "5": {
    "id": "5",
    "ip": "advanced-tidb-tikv-2.advanced-tidb-tikv-peer.advanced-tidb.svc",
    "lastTransitionTime": "2023-11-21T05:46:50Z",
    "leaderCount": 112,
    "podName": "advanced-tidb-tikv-2",
    "state": "Up"
  }
}
| username: andone | Original post link

Take a look at the display.

| username: Fly-bird | Original post link

Has the operation to scale down the nodes been completed? Check the time on each node again.

| username: dba远航 | Original post link

What are the final inspection results?

| username: Kongdom | Original post link

:thinking: Strongly recommend having a UPS in the server room, we’ve suffered twice because of not having one~