Help Needed: Issues Encountered During Disaster Drills for PD and TiKV

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 求助:对PD,TIKV灾难演练碰到的一些问题

| username: Hacker_8fNHonIE

[TiDB Usage Environment]
Test Environment
[TiDB Version]
v4.0.13
[Reproduction Path]
I used pd and tikv to store some clusterid and kv separately. Recently, I conducted a disaster recovery drill. The solution was to use pd-recover and tikv-br for backup and recovery. During the testing process, I encountered some issues. Although I have read part of the documentation, I still couldn’t find a solution. I hope someone can help, thank you.
Disaster Recovery Scenario 1: Complete data loss of tikv
Solution: Use tikv-br for remote data backup and recovery
Testing Process:

  1. Create a three-node cluster and write 5 kv. The tikv store situation is as follows:
    storeid is 1, 4, 5


2) Stop all tikv services and delete the tikv data directory
3) Restart tikv services, but the restart fails. After checking the logs, I found the following

  1. It seems that the tikv node is down and has not been converted to tombstone before being kicked out

  2. Stop all tikv services and use curl -X POST ‘http://192.168.3.40:16000/pd/api/v1/store/$storeid/state?state=Tombstone

  3. Restart pd services and then start tikv services. Tikv starts successfully. Check tikv storeid as

  4. Since storeid is 8, 9, 10, it is not possible to perform tikv-br restore on the nodes. Restore requires the original storeid 1, 4, 5

  5. How to start a new tikv using the original storeid? Or can tikv-br restore to the new storeid???

Disaster Recovery Scenario 2: Complete data loss of pd
Solution: Use pd-recover to repair and restore the original clusterid

  1. Stop pd services and delete the pd data directory
  2. Start pd services. Pd starts successfully and initializes a new cluster id
  3. Execute ./pd-recover --endpoints http://192.168.3.40:16000 --cluster-id original id --alloc-id=01
  4. Clusterid is restored, but since I have written some clusterid information into pd’s data, can I back up and restore it?

[Encountered Issues: Problem Phenomenon and Impact]

  1. When tikv’s storeid has changed, how to use tikv-br for recovery
  2. When pd’s data is lost, besides restoring pd clusterid, can pd data be backed up and restored?
| username: Kongdom | Original post link

:thinking: In scenario one, it seems that a new cluster can be created for recovery. BR recovery refers to restoring to a brand new cluster.

| username: Hacker_8fNHonIE | Original post link

Alright, thank you Dagu. I’ll go take a look.