Using BR Tool to Backup TiKV Data: Backup and Restore Both Indicate Success, but No Data in TiKV

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 使用br工具备份tikv数据,备份及恢复,都提示sucess,但是tikv中没有数据

| username: Anthony99

【TiDB Usage Environment】Test Environment
【TiDB Version】v7.1.0
【Reproduction Path】Using Tikv as the backend for juicefs and storing data;
【Encountered Problem: Phenomenon and Impact】

  1. Execute backup


    Indicates backup success, the backup destination is a ceph rgw bucket (s3)

  2. View the backup files


    The backup files can also be viewed

  3. Enter tikv and execute delete all

  4. Start executing restore

  5. Enter tikv and check the restored data

【Resource Configuration】Enter TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
【Attachments: Screenshots/Logs/Monitoring】

Question?
Could you please tell me why there is no data even though it indicates a successful restore? Is it possible that the parameters are not set correctly? Thank you.

| username: 像风一样的男子 | Original post link

Your BR backup did not specify full or specify the database, resulting in an empty backup.

| username: 像风一样的男子 | Original post link

Backup parameters are missing. Please check the command provided by the official documentation.

| username: Anthony99 | Original post link

Hello, txn and full are parameters at the same level. I am using the txn command because I am not using TiDB, only TiKV and PD.

| username: 像风一样的男子 | Original post link

Check if there are any SST files in the backup folder. If not, it means no data has been backed up.

| username: Anthony99 | Original post link

Yes, you can see it in the Ceph bucket.

| username: cassblanca | Original post link

If SST files are generated, the backup is normal.

| username: Anthony99 | Original post link

Yes, there are SST files, but when restoring, there is no data in TiKV, and the prompt message indicates that the restoration was successful.

| username: caiyfc | Original post link

How much data is in the database? According to the BR logs, everything looks normal, and the KV count is also provided. Please verify the data volume.

| username: Anthony99 | Original post link

Hmm, TiKV is empty; the data volume is very small, only 21 keys. When restoring, it also indicated key=21, but when I go into TiKV and scan, there’s no data. It’s a bit strange.

| username: caiyfc | Original post link

When exporting, scan can find 21 pieces of data, but after deletion and recovery, br indicates success, but scan can no longer find these data, right?

| username: Anthony99 | Original post link

The backup export was successful, and it indicated total-kv=21; I could also see the SST files in the Ceph bucket. During the restore, it indicated success and showed [total-kv=21]; however, I couldn’t find the data in TiKV.

The whole process is as follows:

The key point is: importing into a new TiKV cluster was successful the first time, but it never succeeded again afterward.

| username: caiyfc | Original post link

:thinking: I’ve never encountered this kind of scenario and phenomenon, so I don’t have much experience.

| username: 胡杨树旁 | Original post link

Is this parameter new in version 7? I couldn’t find this parameter in version 6.1.

| username: Anthony99 | Original post link

Yes, version 7.1.

| username: YuJuncen | Original post link

Make a possible guess, first supplement some background knowledge about MVCC and BR implementation, if the original poster knows, you can skip it.

When BR backs up TXN KV, it directly backs up the last version of the raw KV pair. For example, if you perform the following operations:

kv_put("key", "value_v1");
kv_delete("key");
kv_put("key", "reborn");
kv_put("other_key", "other_value");

In TiKV’s underlying storage, these writes will generate three MVCC versions corresponding to three different KV pairs. If you display TiKV’s underlying data as JSON, it looks like:

{
  "key_at_v1": "value_v1",
  "key_at_v2": null, // implies this key has been deleted.
  "key_at_v3": "reborn",
  "other_key_at_v1": "other_value"
}

When BR backs up, it actually directly stores the last version of the underlying KV. This means that if you run BR’s txn KV backup in the above TiKV, you will get:

{
  "key_at_v3": "reborn",
  "other_key_at_v1": "other_value"
}

For performance, during recovery, BR will directly write these keys back to the underlying storage without rewriting their versions; this is also why BR requires an empty cluster (or no data in the table) during recovery: because it will break the transaction model, i.e., writing past versions of data.

Back to the original poster’s question, after the backup is completed, the original poster performed a delete operation on the original cluster (it seems to be using the Txn KV mode according to the screenshot), so what will the underlying storage look like in the initial example?

{
  "key_at_v1": "value_v1",
  "key_at_v2": null, 
  "key_at_v3": "reborn", // <- BR backed up this!
  "key_at_v4": null
  "other_key_at_v1": "other_value", // ...and this!
  "other_key_at_v2": null
}

At this point, you will find that if you write the BR backup data back to the underlying storage, it is actually written to key_at_v3 and other_key_at_v1. So this write is masked, and naturally, the data cannot be found.

If you want to verify this situation, the original poster can try waiting for a while (at least ten minutes) until the MVCC version is GC’d and then try to recover, or directly try to recover in a new cluster.

Perhaps we can rewrite the version numbers of all keys to the current highest during recovery. But in this case, if there are still unfinished transactions during recovery, the consistency of those transactions may be broken.

| username: redgame | Original post link

Check the logs and error messages generated during the backup and restore process to see if there are any anomalies or error prompts. This information may provide more details about the backup or restore failure.

| username: Anthony99 | Original post link

Hello, thank you for your reply. After a night, it still doesn’t respond after recovery.

| username: Anthony99 | Original post link

The logs show success.

| username: YuJuncen | Original post link

Did you deploy TiDB? Without TiDB, the MVCC GC might not be triggered. Anyway, you can try deploying a new cluster to see if it resolves the issue.