On Sunday, I helped a netizen recover their TiDB server, and here is the record. I found that many people do not know how to use pd-recover

translator_bot · June 22, 2024, 4:02pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 周日的时候帮助网友恢复了tidb服务器现在记录如下。我发现很多伙伴不会用使用 pd-recover

| username: tidb狂热爱好者

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version]
[Reproduction Path] What operations were performed when the issue occurred
[Encountered Issue: Problem Phenomenon and Impact]
[Resource Configuration]
[Attachments: Screenshots / Logs / Monitoring]
Since it is a test machine, everything is simplified.

The original server is as follows

Steps

Edit cluster topology

cat > topology.yaml << EOF

global:

user: "tidb"

ssh_port: 22

deploy_dir: "/tidb-deploy"

data_dir: "/home/tidb-data"

server_configs:

tidb:

log.slow-threshold: 300

# tikv:

# storage.engine: partitioned-raft-kv

pd_servers:

- host: 192.168.137.2

tidb_servers:

- host: 192.168.137.3

tikv_servers:

- host: 192.168.137.4
- host: 192.168.137.5
- host: 192.168.137.6

monitoring_servers:

- host: 192.168.137.5

grafana_servers:

- host: 192.168.137.5

alertmanager_servers:

- host: 192.168.137.5

EOF

Download & Unzip & Configure Passwordless SSH

curl -O https://download.pingcap.org/tidb-community-server-v6.6.0-linux-amd64.tar.gz

tar zxf https://download.pingcap.org/tidb-community-server-v6.6.0-linux-amd64.tar.gz

cd tidb-community-server-v6.6.0-linux-amd64

Passwordless SSH

ssh-keygen

ssh-copyid root@127.0.0.1

Install TiDB

bash local_install.sh

source /root/.bash_profile

tiup cluster check ./topology.yaml --apply

tiup cluster deploy tidb-test v6.6.0 ./topology.yaml

tiup cluster start tidb-test

At this point, the entire TiDB starts normally. Using the above script is automated and does not require manual intervention.

We write some data into the database

create database oscardba;

use oscardba;

create table oscardbatable;

At this point, we simulate a PD failure

Start a stress test, then power off the PD server hardware.

Since it is a power-off shutdown, the PD database will be corrupted, causing the entire cluster to be unavailable.

At this point, when we restart the PD server, we will find that the PD server is unavailable and the entire TiDB server is paralyzed.

After entering the system, we find that the PD service is unavailable.

Now let’s start the PD server recovery process

When recovering PD, we need to know two parameters: cluster ID and the highest allocated PD number

cat {{/path/to}}/pd.log | grep "init cluster id"

cat {{/path/to}}/pd*.log | grep "idAllocator allocates a new id" | awk -F'=' '{print $2}' | awk -F']' '{print $1}' | sort -r -n | head -n 1

Since my PD has no data and the logs were deleted once, I couldn’t get the highest allocated number using the official method, so I came up with my own method.

I found the commit with the ID and last index, which should be the system’s highest allocated ID 14015995. I added 1000000 to it, making it 15015995.

After completing everything, first kill the PD

ps -ef | grep pd-server | grep -v grep | cut -c 9-15 | xargs kill -s 9

mv /data/tidb-data/pd-2379 /data/tidb-data/pd-2379bak

After backing up, restart the PD server

service pd-2379 start

At this point, the PD server is empty and needs to be recovered using the PD tool

curl -O https://download.pingcap.org/tidb-community-toolkit-v6.6.0-linux-amd64.tar.gz

tar zxf tidb-community-toolkit-v6.6.0-linux-amd64.tar.gz

cd tidb-community-toolkit-v6.6.0-linux-amd64

tar zxf pd-recover-v6.6.0-linux-amd64

./pd-recover -endpoints http://192.168.1.2:2379 -cluster-id 6747551640615446306 -alloc-id 10000

If the recovery is successful, it will report success.

You need to stop and restart the PD server

service pd-2379 stop

service pd-2379 restart

At this point, the PD server is successfully recovered.

The TiDB server will then function normally.

translator_bot · June 22, 2024, 4:02pm

| username: Kongdom | Original post link

Thank you for sharing.

translator_bot · June 22, 2024, 4:02pm

| username: Jolyne | Original post link

Newbie here to learn.

translator_bot · June 22, 2024, 4:02pm

| username: ffeenn | Original post link

translator_bot · June 22, 2024, 4:02pm

| username: 我是咖啡哥 | Original post link

I don’t know either I haven’t practiced it.
Could you please organize it and write a column article?