Backup v4.0.9 Failed

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: br 备份v4.0.9 失败

| username: 孤独的狼

v4.0.9

【TiDB Usage Environment】Production environment or Test environment or POC
【TiDB Version】
【Encountered Problem】
【Reproduction Path】What operations were performed when the problem occurred
【Problem Phenomenon and Impact】
Cluster type: tidb
Cluster name: tidb-test
Cluster version: v4.0.9
Deploy user: tidb
SSH type: builtin
Dashboard URL: http://172.17.30.118:2379/dashboard
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir


172.17.30.116:9093 alertmanager 172.17.30.116 9093/9094 linux/x86_64 Up /tidb-data/alertmanager-9093 /tidb-deploy/alertmanager-9093
172.17.30.116:3000 grafana 172.17.30.116 3000 linux/x86_64 Up - /tidb-deploy/grafana-3000
172.17.30.117:2379 pd 172.17.30.117 2379/2380 linux/x86_64 Up|L /tidb-data/pd-2379 /tidb-deploy/pd-2379
172.17.30.118:2379 pd 172.17.30.118 2379/2380 linux/x86_64 Up|UI /tidb-data/pd-2379 /tidb-deploy/pd-2379
172.17.30.119:2379 pd 172.17.30.119 2379/2380 linux/x86_64 Up /tidb-data/pd-2379 /tidb-deploy/pd-2379
172.17.30.116:9090 prometheus 172.17.30.116 9090 linux/x86_64 Up /tidb-data/prometheus-9090 /tidb-deploy/prometheus-9090
172.17.30.117:4000 tidb 172.17.30.117 4000/10080 linux/x86_64 Up - /tidb-deploy/tidb-4000
172.17.30.118:4000 tidb 172.17.30.118 4000/10080 linux/x86_64 Up - /tidb-deploy/tidb-4000
172.17.30.119:4000 tidb 172.17.30.119 4000/10080 linux/x86_64 Up - /tidb-deploy/tidb-4000
172.17.30.117:20160 tikv 172.17.30.117 20160/20180 linux/x86_64 Up /tidb-data/tikv-20160 /tidb-deploy/tikv-20160
172.17.30.118:20160 tikv 172.17.30.118 20160/20180 linux/x86_64 Up /tidb-data/tikv-20160 /tidb-deploy/tikv-20160
172.17.30.119:20160 tikv 172.17.30.119 20160/20180 linux/x86_64 Down /tidb-data/tikv-20160 /tidb-deploy/tikv-20160

【Attachments】

Please provide the version information of each component, such as cdc/tikv, which can be obtained by executing cdc version/tikv-server --version.

The backup command and logs are as follows
[tidb@tidb-30-117 bakcup]$ br backup db --pd “172.17.30.117:2379” --db CRM_Tag --storage “local:///bakcup/full” --ratelimit 120 --log-file backupdb_CRM_Tag.log
Detail BR log in backupdb_CRM_Tag.log
Database backup <…> 0.00%
Error: context deadline exceeded

Both pd and kv nodes have directories with this permission granted. As follows. Nodes 172.17.30.117, 118, and 119 all have this directory.
mkdir /bakcup/full
chmod 777 /bakcup/full
chown tidb:tidb /bakcup/full

| username: 孤独的狼 | Original post link

The specific log of the backup error is in the attached file. I ran the backup using the BR tool installed on the PD at 172.17.30.117, but it failed.

| username: 孤独的狼 | Original post link

The log is as follows. Master Tian

| username: 孤独的狼 | Original post link

Uploading: tikv.log.2022-07-12-15%3A19%3A07.001059417…

Here are the logs, please help me check.

| username: undefined | Original post link

Isn’t this an issue with the kv configuration?

| username: 孤独的狼 | Original post link

Log 1 Uploading: tikv.log.2022-07-12-15%3A19%3A07.001059417…

There are two logs.

| username: 孤独的狼 | Original post link

Sure, I see it. I’ll upload the configuration file. Please help me check how to modify it.
last_tikv.toml (14.0 KB)

| username: tidb狂热爱好者 | Original post link

It looks like you have bound the CPU incorrectly.

| username: 孤独的狼 | Original post link

How did you know? I didn’t quite understand.

| username: Kongdom | Original post link

Is it possible that the grpc-memory-pool-quota parameter is set too high?

| username: Kongdom | Original post link

The prompt indicates that there is an issue with the configuration on line 207. You can compare the configurations in last_tikv.toml and .tiup/storage/cluster/clusters/cluster_name/meta.yaml to see if they are the same, or check if there are any abnormal configurations in meta.yaml.

| username: 孤独的狼 | Original post link

The logs are as follows.

| username: 孤独的狼 | Original post link

Thank you everyone, the issue has been resolved. The problem was with the KV node at 119, which was affecting the physical BR backup. Taking the KV node at 119 offline resolved the issue, and it is now in Pending Offline status.

| username: cs58_dba | Original post link

In this situation, it feels necessary to set up proper monitoring and alert immediately when an issue arises.

| username: system | Original post link

This topic will be automatically closed 60 days after the last reply. No new replies are allowed.