Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.Original topic: 周日的时候帮助网友恢复了tidb服务器现在记录如下。我发现很多伙伴不会用使用 pd-recover

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version]
[Reproduction Path] What operations were performed when the issue occurred
[Encountered Issue: Problem Phenomenon and Impact]
[Resource Configuration]
[Attachments: Screenshots / Logs / Monitoring]
Since it is a test machine, everything is simplified.
The original server is as follows
Steps
- Edit cluster topology
cat > topology.yaml << EOF
global:
user: "tidb"
ssh_port: 22
deploy_dir: "/tidb-deploy"
data_dir: "/home/tidb-data"
server_configs:
tidb:
log.slow-threshold: 300
# tikv:
# storage.engine: partitioned-raft-kv
pd_servers:
- host: 192.168.137.2
tidb_servers:
- host: 192.168.137.3
tikv_servers:
- host: 192.168.137.4
- host: 192.168.137.5
- host: 192.168.137.6
monitoring_servers:
- host: 192.168.137.5
grafana_servers:
- host: 192.168.137.5
alertmanager_servers:
- host: 192.168.137.5
EOF
- Download & Unzip & Configure Passwordless SSH
curl -O https://download.pingcap.org/tidb-community-server-v6.6.0-linux-amd64.tar.gz
tar zxf https://download.pingcap.org/tidb-community-server-v6.6.0-linux-amd64.tar.gz
cd tidb-community-server-v6.6.0-linux-amd64
- Passwordless SSH
ssh-keygen
ssh-copyid root@127.0.0.1
- Install TiDB
bash local_install.sh
source /root/.bash_profile
tiup cluster check ./topology.yaml --apply
tiup cluster deploy tidb-test v6.6.0 ./topology.yaml
tiup cluster start tidb-test
At this point, the entire TiDB starts normally. Using the above script is automated and does not require manual intervention.
We write some data into the database
create database oscardba;
use oscardba;
create table oscardbatable;
At this point, we simulate a PD failure
Start a stress test, then power off the PD server hardware.
Since it is a power-off shutdown, the PD database will be corrupted, causing the entire cluster to be unavailable.
At this point, when we restart the PD server, we will find that the PD server is unavailable and the entire TiDB server is paralyzed.
After entering the system, we find that the PD service is unavailable.
Now let’s start the PD server recovery process
When recovering PD, we need to know two parameters: cluster ID and the highest allocated PD number
cat {{/path/to}}/pd.log | grep "init cluster id"
cat {{/path/to}}/pd*.log | grep "idAllocator allocates a new id" | awk -F'=' '{print $2}' | awk -F']' '{print $1}' | sort -r -n | head -n 1
Since my PD has no data and the logs were deleted once, I couldn’t get the highest allocated number using the official method, so I came up with my own method.
I found the commit with the ID and last index, which should be the system’s highest allocated ID 14015995. I added 1000000 to it, making it 15015995.
After completing everything, first kill the PD
ps -ef | grep pd-server | grep -v grep | cut -c 9-15 | xargs kill -s 9
mv /data/tidb-data/pd-2379 /data/tidb-data/pd-2379bak
After backing up, restart the PD server
service pd-2379 start
At this point, the PD server is empty and needs to be recovered using the PD tool
curl -O https://download.pingcap.org/tidb-community-toolkit-v6.6.0-linux-amd64.tar.gz
tar zxf tidb-community-toolkit-v6.6.0-linux-amd64.tar.gz
cd tidb-community-toolkit-v6.6.0-linux-amd64
tar zxf pd-recover-v6.6.0-linux-amd64
./pd-recover -endpoints http://192.168.1.2:2379 -cluster-id 6747551640615446306 -alloc-id 10000
If the recovery is successful, it will report success.
You need to stop and restart the PD server
service pd-2379 stop
service pd-2379 restart
At this point, the PD server is successfully recovered.
The TiDB server will then function normally.