Ask a question: If I have a three-replica TiDB and the PD node is lost, and TiKV is damaged, how do I recover the data?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 提一个问题 如果我三副本的tidb,pd节点丢失。tikv有损坏。怎么恢复数据?

| username: tidb狂热爱好者

If my TiDB with three replicas loses a PD node and a TiKV node is damaged, how can I recover the data? In other words, can the data from a three-replica TiKV be recovered from a single node?

| username: TiDB_C罗 | Original post link

With three replicas of TiKV, each TiKV node has replica data from the other two nodes, so theoretically it can be restored.

| username: tidb狂热爱好者 | Original post link

Are there specific steps?

| username: Demo二棉裤 | Original post link

Normally, each node has only one replica.

SELECT *
FROM (
    SELECT region_id, COUNT(*) AS rc
    FROM TIKV_REGION_PEERS rp
        LEFT JOIN TIKV_STORE_STATUS ss ON rp.store_id = ss.store_id
    WHERE ADDRESS LIKE '10.10.10.1:%' -- Remember to replace with the corresponding IP address here
    GROUP BY REGION_ID
) tmp
WHERE rc > 1;

You can check this. If the number of replicas is less than 1, theoretically it can be recovered. It depends on how you label your storage.

| username: Jellybean | Original post link

You can refer to these articles:

| username: Kongdom | Original post link

:yum: With these three articles, you can navigate the entire community.

Personally, I think you should rebuild PD first, then repair TiKV.

The PD cluster only stores metadata information and reports it through TiKV heartbeats. Therefore, after all data in the PD cluster is lost, the entire TiDB cluster can be repaired by rebuilding the PD cluster.

| username: zhanggame1 | Original post link

Theoretically feasible, but in practice, you probably need to contact the original manufacturer.

| username: FutureDB | Original post link

Issues with TiDB and PD are not a major concern; the main focus is still on the three replicas of TiKV. If only one replica is damaged, it’s manageable, but if two replicas are damaged, it becomes problematic.

| username: WinterLiu | Original post link

Theoretically feasible, but in practice, you probably need to contact the original manufacturer.

| username: caiyfc | Original post link

Your question is essentially about how to recover if TiKV is damaged. Even if all PD nodes are down and cannot be started, you can use PD recover to rebuild the PD cluster. Once PD is up, other components can be brought up as well. Then, you only need to assess the extent of the damage to TiKV and whether the data can be recovered. As long as only a few nodes are lost, there is a high probability that the data can be recovered normally.

| username: TIDB-Learner | Original post link

Your question is essentially about how to recover if TiKV is damaged. Even if all PD nodes are down and can’t be brought back up, you can use PD recover to rebuild the PD cluster. Once PD is up, other components can be started. Then, you only need to assess the extent of the damage to TiKV and whether data can be recovered. As long as only a few nodes are lost, there is a high probability that data can be recovered normally.

| username: Jack-li | Original post link

Various errors encountered in actual operations.

| username: jiayou64 | Original post link

Read with admiration :clap: :slightly_smiling_face:

| username: zhh_912 | Original post link

The expert is very professional.

| username: zhh_912 | Original post link

Start the PD process on the new node based on the backed-up PD configuration information and topology. Start the missing TiKV nodes one by one and add them to the cluster. If there are nodes with corrupted data, you can try to repair or replace the hard drive, and then re-add them to the cluster. Based on the backup data, use tools like importer in TiDB to re-import the data into the cluster.

| username: lemonade010 | Original post link

Make full backups and retain logs in real-time; this is a hard truth. If you encounter this level of damage, the recovery time from backups is likely much faster than rebuilding.

| username: 友利奈绪 | Original post link

Bookmark it.

| username: TiDBer_小阿飞 | Original post link

Stop all KV nodes and execute the following command. This command will remove some failed machines from the peers list of all Regions. This way, these Regions can continue to provide services with the remaining healthy replicas after TiKV restarts.

tikv-ctl --db /path/to/tikv/db unsafe-recover remove-fail-stores 1

After executing the above command, check the store information using the PD tool:
"state_name": "Tombstone" indicates that the TiKV node has been successfully decommissioned.

| username: TiDBer_QYr0vohO | Original post link

Refer to this: 专栏 - TiKV 多副本丢失以及修复实践 | TiDB 社区

| username: 鱼跃龙门 | Original post link

It depends on how many TiKV nodes are damaged. If the number of damaged peers in a region is less than half, the cluster will not lose data. If it is more than half but sync-log=true is enabled, data will also not be lost.