BR Incremental Restore Failure

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: BR 增量恢复失败

| username: yuqi1129

【TiDB Usage Environment】Production\Testing Environment\POC
Production
【TiDB Version】
4.0.15
【Encountered Problem】
Incremental recovery encountered unique key conflict

[2022/06/23 19:01:40.184 +08:00] [ERROR] [db.go:81] [“execute ddl query failed”] [query=“ALTER TABLE supplier_environment_total ADD UNIQUE uk_key_code(key_code)”] [db=supply_chain_factory] [historySchemaVersion=2360] [error=“[kv:1062]Duplicate entry ‘’ for key ‘uk_key_code’”] [errorVerbose=“[kv:1062]Duplicate entry ‘’ for key ‘uk_key_code’
github.com/pingcap/errors.AddStack
\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174
github.com/pingcap/errors.Trace
\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/juju_adaptor.go:15
github.com/pingcap/tidb/ddl.(*ddl).doDDLJob
\tgithub.com/pingcap/tidb@v1.1.0-beta.0.20210714111333-67b641d5036c/ddl/ddl.go:578
github.com/pingcap/tidb/ddl.(*ddl).CreateIndex
\tgithub.com/pingcap/tidb@v1.1.0-beta.0.20210714111333-67b641d5036c/ddl/ddl_api.go:4034
github.com/pingcap/tidb/ddl.(*ddl).AlterTable
\tgithub.com/pingcap/tidb@v1.1.0-beta.0.20210714111333-67b641d5036c/ddl/ddl_api.go:2117
github.com/pingcap/tidb/executor.(*DDLExec).executeAlterTable
\tgithub.com/pingcap/tidb@v1.1.0-beta.0.20210714111333-67b641d5036c/executor/ddl.go:366
github.com/pingcap/tidb/executor.(*DDLExec).Next
\tgithub.com/pingcap/tidb@v1.1.0-beta.0.20210714111333-67b641d5036c/executor/ddl.go:86
github.com/pingcap/tidb/executor.Next
\tgithub.com/pingcap/tidb@v1.1.0-beta.0.20210714111333-67b641d5036c/executor/executor.go:262
github.com/pingcap/tidb/executor.(*ExecStmt).handleNoDelayExecutor
\tgithub.com/pingcap/tidb@v1.1.0-beta.0.20210714111333-67b641d5036c/executor/adapter.go:531
github.com/pingcap/tidb/executor.(*ExecStmt).handleNoDelay
\tgithub.com/pingcap/tidb@v1.1.0-beta.0.20210714111333-67b641d5036c/executor/adapter.go:413
github.com/pingcap/tidb/executor.(*ExecStmt).Exec
\tgithub.com/pingcap/tidb@v1.1.0-beta.0.20210714111333-67b641d5036c/executor/adapter.go:366
github.com/pingcap/tidb/session.runStmt
\tgithub.com/pingcap/tidb@v1.1.0-beta.0.20210714111333-67b641d5036c/session/tidb.go:322
github.com/pingcap/tidb/session.(*session).ExecuteStmt
\tgithub.com/pingcap/tidb@v1.1.0-beta.0.20210714111333-67b641d5036c/session/session.go:1381
github.com/pingcap/tidb/session.(*session).ExecuteInternal
\tgithub.com/pingcap/tidb@v1.1.0-beta.0.20210714111333-67b641d5036c/session/session.go:1132
github.com/pingcap/br/pkg/gluetidb.(*tidbSession).Execute
\tgithub.com/pingcap/br@/pkg/gluetidb/glue.go:109
github.com/pingcap/br/pkg/restore.(*DB).ExecDDL
\tgithub.com/pingcap/br@/pkg/restore/db.go:79
github.com/pingcap/br/pkg/restore.(*Client).ExecDDLs
\tgithub.com/pingcap/br@/pkg/restore/client.go:500
github.com/pingcap/br/pkg/task.RunRestore
\tgithub.com/pingcap/br@/pkg/task/restore.go:292
main.runRestoreCommand
\tgithub.com/pingcap/br@/cmd/br/restore.go:25
main.newFullRestoreCommand.func1
\tgithub.com/pingcap/br@/cmd/br/restore.go:97
github.com/spf13/cobra.(*Command).execute
\tgithub.com/spf13/cobra@v1.0.0/command.go:842
github.com/spf13/cobra.(*Command).ExecuteC
\tgithub.com/spf13/cobra@v1.0.0/command.go:950
github.com/spf13/cobra.(*Command).Execute
\tgithub.com/spf13/cobra@v1.0.0/command.go:887
main.main
\tgithub.com/pingcap/br@/cmd/br/main.go:56
runtime.main
\truntime/proc.go:203
runtime.goexit
\truntime/asm_amd64.s:1357”] [stack=“github.com/pingcap/br/pkg/restore.(*DB).ExecDDL
\tgithub.com/pingcap/br@/pkg/restore/db.go:81
github.com/pingcap/br/pkg/restore.(*Client).ExecDDLs
\tgithub.com/pingcap/br@/pkg/restore/client.go:500
github.com/pingcap/br/pkg/task.RunRestore
\tgithub.com/pingcap/br@/pkg/task/restore.go:292
main.runRestoreCommand
\tgithub.com/pingcap/br@/cmd/br/restore.go:25
main.newFullRestoreCommand.func1
\tgithub.com/pingcap/br@/cmd/br/restore.go:97
github.com/spf13/cobra.(*Command).execute
\tgithub.com/spf13/cobra@v1.0.0/command.go:842
github.com/spf13/cobra.(*Command).ExecuteC
\tgithub.com/spf13/cobra@v1.0.0/command.go:950
github.com/spf13/cobra.(*Command).Execute
\tgithub.com/spf13/cobra@v1.0.0/command.go:887
main.main
\tgithub.com/pingcap/br@/cmd/br/main.go:56
runtime.main
\truntime/proc.go:203”]

After investigation, during BR recovery, DDL is restored first, then data is restored. This logic works fine for full backup recovery but has issues with incremental backup recovery. For example,

A table
t1 performs DDL to add a column with a default value of 1
t2 modifies the default value to eliminate duplicates
t3 adds a unique index

Time relationship: t1 < t2 < t3

If DDL is executed first, i.e., adding the column, then adding the unique index, and finally restoring the data, it will result in a unique key conflict issue.

Regarding BR incremental recovery, I read an article stating that it is a formal feature in 4.x and 5.x, but it becomes an experimental feature in 6.x. Is this because a bug was discovered?

【Reproduction Path】What operations were performed to encounter the problem
【Problem Phenomenon and Impact】
【Attachments】

  • Relevant logs, configuration files, Grafana monitoring (https://metricstool.pingcap.com/)
  • TiUP Cluster Display information
  • TiUP Cluster Edit config information
  • TiDB-Overview monitoring
  • Corresponding module Grafana monitoring (if any, such as BR, TiDB-binlog, TiCDC, etc.)
  • Corresponding module logs (including logs one hour before and after the issue)

For questions related to performance optimization and fault troubleshooting, please download the script and run it. Please select all and copy-paste the terminal output results for upload.

| username: xiaohetao | Original post link

Did you specify the last backup ts when you backed up?

| username: yuqi1129 | Original post link

Well, specifying the last backup ts means incremental backup.

| username: yuqi1129 | Original post link

Isn’t this document only available in 6.1? 6.1 made incremental backup an experimental feature, it wasn’t available in 5.x and 4.x.

| username: yuqi1129 | Original post link

The image you provided is not accessible. Please provide the text you need translated.

| username: yuqi1129 | Original post link

This description feels too simple. Based on the principle (executing DDL first and then loading data, rather than executing in the actual order), there will generally be issues during the incremental period if there are DDL operations, such as adding unique indexes, etc.

| username: xiaohetao | Original post link

The principle is the same, all the key points have been mentioned, and the documentation is becoming more and more complete.

| username: xiaohetao | Original post link

After your previous full backup restoration, did you check if the supplier_environment_total table has a UNIQUE uk_key_code(key_code)?

| username: yuqi1129 | Original post link

No, this UNIQUE key was added during the incremental backup period.

| username: yuqi1129 | Original post link

Is there an expert who can help answer this question?

| username: IANTHEREAL | Original post link

Before version 6.0, the documentation on incremental backup functionality was incomplete. We have improved it in version 6.1. We apologize for any misunderstanding caused.

| username: IANTHEREAL | Original post link

If t3 is done first, and there is a conflict in the data between t1 and t2, there will definitely be a conflict error during recovery. Is there any problem with restoring according to the chronological order?

| username: yuqi1129 | Original post link

I can confirm that it is strictly in chronological order. In the example above:

  • At time t0, a full backup was completed.
  • At time t1, a column was added with a default value of 1.
  • At time t2, the default value was modified to eliminate duplicates.
  • At time t3, a unique index was added.
  • At time t4, an incremental backup was made for the data between [t0, t4].

During restoration, first restoring the state at t0 poses no issues. However, when restoring the state at t4, due to performing DDL operations first (i.e., adding columns and unique indexes) and then importing data, it results in unique key conflicts.

| username: IANTHEREAL | Original post link

BR incremental restore will execute the DDL that occurred during the incremental backup, so there is no need to execute it manually. If you need to perform additional DDL, you can choose to do so after the incremental restore. How about trying not to manually execute DDL before the incremental restore?

| username: yuqi1129 | Original post link

I didn’t execute DDL manually; all DDLs were executed through BR. If there are no DDLs during the BR incremental period, there should logically be no issues.

| username: yuqi1129 | Original post link

What I want to express is that during incremental backup recovery, the DDL in the incremental backup executed by BR encounters errors, and it has nothing to do with manual execution.

What I have been trying to express is that if there are DDL operations in the incremental backup data, errors might occur during recovery.

(PS: Is it my expression that’s problematic? :joy:)

| username: yilong | Original post link

  1. How exactly are incremental backups and restores performed? From your explanation, it seems the error is caused by adding a unique index after setting the default value first.
  2. Based on your backup and restore times, I will try to reproduce the issue. (For example, are t1, t2, and t3 all in one incremental backup?)
| username: yuqi1129 | Original post link

I can confirm that it follows a strict timeline. The example above:

t0: A full backup is completed.
t1: A column is added with a default value of 1.
t2: The default value is modified to eliminate duplicates.
t3: A unique index is added.
t4: Incremental backup of data between [t0, t4].

t0 < t1 < t2 < t3 < t4

During recovery, restoring at t0 is not an issue. However, when restoring at t4, since DDL operations (adding columns, adding unique indexes) are performed first before importing data, it leads to unique key conflicts.

| username: yuqi1129 | Original post link

I have already reproduced this locally.

Then at 14:40, a full backup was taken.
Next,

alter table unique_test add age int default 1;
update unique_test set age = 2 where id = 1;
update unique_test set age = 3 where id = 3;
alter table unique_test add unique index uni_idx(age);
Completed at 14:45.
Then, a backup of the incremental changes between 14:40 and 14:46 was taken.
Finally, restore the full backup from 14:40 and the incremental backup from 14:40 to 14:46 in sequence.

| username: yilong | Original post link

  1. I can reproduce it in version v5.4.0 and have recorded an issue: BR Restore incremental data encountered Error: [kv:1062]Duplicate entry · Issue #1471 · pingcap/br · GitHub
  2. Initially, I planned to drop the index and then test the incremental backup. However, the GC time in the test environment has passed, so incremental backup is not possible. I assume the GC time in your production environment has also passed, right?
  3. Currently, it seems that a full backup might be needed again.