Backup and Restore of TIKV RawKV: Issues During Restoration

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIKV RawKV的备份和恢复,恢复时出现问题

| username: 中国电信TIKV

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version]
[Reproduction Path] What operations were performed when the issue occurred
[Encountered Issues: Problem Phenomenon and Impact]
[Resource Configuration]
[Attachments: Screenshots / Logs / Monitoring]
Question 1: The documentation does not mention how to back up full rawkv data. Is it done without adding the --start and --end parameters?
Question 2: As mentioned, the current test results show that backups to both local and s3 media are successful, but there are different issues during the restore process. The remote s3 does not report errors but shows 0 restored data; local restore reports errors.

1. Test Process - Remote s3 (ceph)

[root@nma07-304-d19-sev-r740-2u21 tls]# /opt/TDP/tidb-community-server-v6.3.0-linux-amd64/br backup raw --pd "127.0.0.1:2379" --ca ca.crt --cert client.crt --key client.pem --ratelimit 128 --cf default --storage "s3://juicefsmetabk/test1205?endpoint=http://10.37.70.2:8081&access-key=J5PSR9YQL0TJ4BBXFTWD&secret-access-key=xQlY47EWvA2URPwcBt7ZB9d72iKK7jss8Bb5PSS5" --send-credentials-to-tikv=true
Detail BR log in /tmp/br.log.2022-12-05T09.10.54+0800
Raw Backup <----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
[2022/12/05 09:10:54.908 +08:00] [INFO] [collector.go:69] ["Raw Backup success summary"] [total-ranges=6] [ranges-succeed=6] [ranges-failed=0] [backup-total-regions=6] [total-take=392.84766ms] [total-kv=20437] [total-kv-size=16.72MB] [average-speed=42.55MB/s] [backup-data-size(after-compressed)=761.2kB]
  1. View 2000 data entries and store them for comparison with restored data
cd /root/.tiup/storage/cluster/clusters/csfl-cluster/tls/
tiup ctl:v6.3.0 tikv --ca-path ca.crt --cert-path client.crt --key-path client.pem --host 10.37.70.31:20160 --data-dir /software/tidb-data/tikv-20160 scan --from 'z' --limit 2000 --show-cf lock,default,write
  1. Clear all data from the cluster and restart the cluster
tiup cluster clean prod-cluster --data
tiup cluster start prod-cluster
  1. Import the backup data
/opt/TDP/tidb-community-server-v6.3.0-linux-amd64/br restore raw --pd "127.0.0.1:2379" --ca ca.crt --cert client.crt --key client.pem --ratelimit 128 --cf default --storage "s3://juicefsmetabk/test1205?endpoint=http://10.37.70.2:8081&access-key=J5PSR9YQL0TJ4BBXFTWD&secret-access-key=xQlY47EWvA2URPwcBt7ZB9d72iKK7jss8Bb5PSS5" --send-credentials-to-tikv=true

Raw Restore <---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
[2022/12/05 14:02:28.700 +08:00] [INFO] [collector.go:69] [“Raw Restore success summary”] [total-ranges=0] [ranges-succeed=0] [ranges-failed=0] [restore-files=6] [total-take=482.410289ms] [Result=“Nothing to restore”] [total-kv=20437] [total-kv-size=16.72MB] [average-speed=34.65MB/s]

Shows 0 restored data
5. View 2000 data entries and compare with the previous data

tiup ctl:v6.3.0 tikv --ca-path ca.crt --cert-path client.crt --key-path client.pem --host 10.37.70.31:20160 --data-dir /software/tidb-data/tikv-20160 scan --from 'z' --limit 2000 --show-cf lock,default,write >2001

Found that data starting with jfs was indeed not restored

2. Test Process - Local Disk

  1. Perform local backup and test two scenarios with and without the start parameter for comparison, as the official documentation does not provide a full backup method. Attempt to delete the start and end parameters.

jfs encoded as 6A6673 as the start value. Not sure if this is the actual range data we need, but there is indeed data.

/opt/TDP/tidb-community-server-v6.3.0-linux-amd64/br backup raw --pd "127.0.0.1:2379" --ca ca.crt --cert client.crt --key client.pem --start 6A6673 --ratelimit 128 --cf default --storage "local:///home/tidb/backuprawzjfs6A6673"
[2022/12/06 10:01:08.036 +08:00] [INFO] [collector.go:69] ["Raw Backup success summary"] [total-ranges=2] [ranges-succeed=2] [ranges-failed=0] [backup-total-regions=2] [total-take=84.392519ms] [backup-data-size(after-compressed)=829.5kB] [total-kv=20853] [total-kv-size=19.03MB] [average-speed=225.5MB/s]

Without start and end parameters, success and data volume display are the same as above

/opt/TDP/tidb-community-server-v6.3.0-linux-amd64/br backup raw --pd "127.0.0.1:2379" --ca ca.crt --cert client.crt --key client.pem --ratelimit 128 --cf default --storage "local:///home/tidb/backupraw"
[2022/12/06 09:58:11.842 +08:00] [INFO] [collector.go:69] ["Raw Backup success summary"] [total-ranges=2] [ranges-succeed=2] [ranges-failed=0] [backup-total-regions=2] [total-take=84.355335ms] [total-kv=20853] [total-kv-size=19.03MB] [average-speed=225.6MB/s] [backup-data-size(after-compressed)=829.5kB]
  1. Clear all data from the cluster and restart the cluster
tiup cluster clean prod-cluster --data
tiup cluster start prod-cluster
  1. Import the backup data
[tidb@nma07-304-d19-sev-r740-2u21 tls]$ /opt/TDP/tidb-community-server-v6.3.0-linux-amd64/br restore raw --pd "127.0.0.1:2379" --ca ca.crt --cert client.crt --key client.pem --start 6A6673 --ratelimit 128 --cf default --storage "local:///home/tidb/backuprawzjfs6A6673"
[tidb@nma07-304-d19-sev-r740-2u21 tls]$ /opt/TDP/tidb-community-server-v6.3.0-linux-amd64/br restore raw --pd "127.0.0.1:2379" --ca ca.crt --cert client.crt --key client.pem --ratelimit 128 --cf default --storage "local:///home/tidb/backupraw"

| username: 中国电信TIKV | Original post link

I referred to this document for RawKV backup and restore for your convenience to check.

| username: pingyu | Original post link

Use TiKV BR, refer to TiKV | RawKV BR

TiDB BR will no longer support backup and restore for RawKV in the future.

| username: 中国电信TIKV | Original post link

Thank you. I also used the test you recommended, and there was an issue during the recovery process. The test procedure is as follows:

  1. Write data
client.put(ByteString.copyFromUtf8("k1"), ByteString.copyFromUtf8("Hello"));
client.put(ByteString.copyFromUtf8("k2"), ByteString.copyFromUtf8(","));
client.put(ByteString.copyFromUtf8("k3"), ByteString.copyFromUtf8("World"));
client.put(ByteString.copyFromUtf8("k4"), ByteString.copyFromUtf8("!"));
client.put(ByteString.copyFromUtf8("k5"), ByteString.copyFromUtf8("Raw KV"));
  1. Backup
./tikv-br backup raw --pd="192.168.72.32:2379" --storage="local:///home/tidb/backupraw/" --log-file="/tmp/backupraw1.log" --gcttl=5m --format="raw"
  1. Clear data
tiup cluster clean tidb-test --data
  1. Restore
./tikv-br restore raw \
--pd "192.168.72.32:2379" \
--storage "local:///home/tidb/backupraw" \
--log-file restoreraw.log

The recovery failed with the following error:

Cannot read local:///home/tidb/backupraw/4/6_2_default.sst into /ssd13/tidb-data/tikv-20160/import/.temp/27682296-4918-4ede-bc85-e327b5a6ce2c_8_5_2_default.sst: No such file or directory (os error 2): [BR:KV:ErrKVDownloadFailed]download sst failed;
| username: 中国电信TIKV | Original post link

  1. Clear Data
    tiup cluster clean tidb-test --data

I found that after clearing the data at this step, the folder under the data directory will be emptied: /ssd13/tidb-data/tikv-20160/import/.temp/

Is it not allowed to clear data when restoring data?

| username: pingyu | Original post link

TiKV cannot read this file local:///home/tidb/backupraw/4/6_2_default.sst
How is the cluster deployed? If it is a multi-machine deployment, you need to use a distributed file system to store backup files.

| username: 中国电信TIKV | Original post link

Hello experts! Based on previous responses from experts, I have switched to using Ceph’s S3 as the storage medium and used the BR tool from the TiKV official website, but I still encountered an error. Please take a look. Thank you.

[root@nma07-304-d19-sev-r740-2u21 lzs]# export AWS_ACCESS_KEY_ID="M0QELJB5OKQLEB0MXTDE";
[root@nma07-304-d19-sev-r740-2u21 lzs]# export AWS_SECRET_ACCESS_KEY="w4r9N3ZydOZHjoSsBrCCwgr7LDiaWPVaS4mmkiz7";
[root@nma07-304-d19-sev-r740-2u21 lzs]# ./tikv-br backup raw --pd="10.37.70.31:2379" --storage "s3://tikvtest/test0301" --s3.endpoint "http://10.37.69.240:8081" --ratelimit=128 --dst-api-version=v2 --log-file="/tmp/br_backup.log"
Detail BR log in /tmp/br_backup.log 
[2023/03/01 10:32:18.776 +08:00] [INFO] [collector.go:67] ["Raw backup failed summary"] [total-ranges=0] [ranges-succeed=0] [ranges-failed=0]Error: error occurred when checking backupmeta file: Forbidden: Forbidden status code: 403, request id: tx000007e49fa4bdc7907c0-0063feb932-447d5-default
| username: 中国电信TIKV | Original post link

Hello, it is a multi-machine deployment. We have just set up an S3 environment, so we tested it again. Please take another look. It’s a post replied to today.

| username: pingyu | Original post link

Check the read and write permissions of the ceph path.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.