Deleting logs up to a future point in time with PITR will result in the inability to perform consistent recovery to a specified point in time

translator_bot · June 21, 2024, 3:48pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: PITR删除截止到未来一个时间的日志时会导致无法做一致性恢复到指定时间点

| username: 人如其名

[Problem Scenario Involved in the Requirement]
Under normal circumstances:

Enable the log backup task for PITR.
Perform a full backup of the cluster.
Specify to clean log logs after N days (i.e., truncate --until ‘now() + N days’ greater than the next backup time).
Perform a full backup of the cluster.
Use pitr to perform a consistent point-in-time recovery of the cluster (the current time has not yet reached now() + N days), error and failure.

[Expected Requirement Behavior]
Initiate truncate --until ‘now() + N days’, record the current time or the last archived time in the v1_stream_truncate_safepoint.txt file.

[Alternative Solution for the Requirement]
Do not write a future time for the cleanup time.

Reproduction Steps:

# Enable br log archive
[tidb@localhost backup]$ ./br log start -u 192.168.31.201:2379 -s /data/backup/log --task-name=pitr
Detail BR log in /tmp/br.log.2023-11-03T22.59.39+0800 
[2023/11/03 22:59:39.704 +08:00] [INFO] [collector.go:77] ["log start"] [streamTaskInfo="{taskName=pitr,startTs=445388837262983175,endTS=999999999999999999,tableFilter=*.*}"] [pausing=false] [rangeCount=2]
[2023/11/03 22:59:43.242 +08:00] [INFO] [collector.go:77] ["log start success summary"] [total-ranges=0] [ranges-succeed=0] [ranges-failed=0] [backup-checksum=10.426828ms] [total-take=3.60572121s]

# Perform the first full backup of the database to activate the br log.
[tidb@localhost backup]$ ./br backup full -u 192.168.31.201:2379 -s /data/backup/data
Detail BR log in /tmp/br.log.2023-11-03T23.00.06+0800 
Full Backup <-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
Checksum <----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
[2023/11/03 23:01:23.691 +08:00] [INFO] [collector.go:77] ["Full Backup success summary"] [total-ranges=228] [ranges-succeed=228] [ranges-failed=0] [backup-checksum=5.81226722s] [backup-fast-checksum=4.925917ms] [backup-total-ranges=162] [total-take=1m16.807251374s] [backup-data-size(after-compressed)=5.011GB] [Size=5011473467] [BackupTS=445388844668551189] [total-kv=105338332] [total-kv-size=18.8GB] [average-speed=244.8MB/s]
[tidb@localhost backup]$ ./br log  status --task-name=pitr -u 192.168.31.201:2379
Detail BR log in /tmp/br.log.2023-11-03T23.01.44+0800 
● Total 1 Tasks.
> #1 <
              name: pitr
            status: ● NORMAL
             start: 2023-11-03 22:59:39.647 +0800
               end: 2090-11-18 22:07:45.624 +0800
           storage: local:///data/backup/log
       speed(est.): 0.00 ops/s
checkpoint[global]: 2023-11-03 22:59:39.647 +0800; gap=2m6s

# In an emergency scenario, plan to clear all logs (but for convenience, I directly wrote a future time), I want to reinitialize a backup next time and then automatically archive the logs.
[tidb@localhost backup]$ ./br log truncate --until "`date -d "+24 hours" "+%Y-%m-%d %H:%M:%S%z"`" -s "local:///data/backup/log"
Detail BR log in /tmp/br.log.2023-11-03T23.06.03+0800 
Reading Metadata... DONE { take = 0s }
We are going to remove 3 files, until 2023-11-04 23:06:03.0000.
Sure? (y/N) y
Clearing Data Files and Metadata  ::  DONE { take = 1ms, kv-count = 303791, kv-size = 13222545(13.22MB) }
[tidb@localhost backup]$ rm -rf /data/backup/data/*

# Perform another full backup of the database
[tidb@localhost backup]$ ./br backup full -u 192.168.31.201:2379 -s /data/backup/data
Detail BR log in /tmp/br.log.2023-11-03T23.06.32+0800 
Full Backup <-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
Checksum <----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
[2023/11/03 23:07:48.966 +08:00] [INFO] [collector.go:77] ["Full Backup success summary"] [total-ranges=230] [ranges-succeed=230] [ranges-failed=0] [backup-checksum=5.858122395s] [backup-fast-checksum=4.970793ms] [backup-total-ranges=164] [total-take=1m16.051382355s] [backup-data-size(after-compressed)=5.026GB] [Size=5025814751] [BackupTS=445388945868980233] [total-kv=105642026] [total-kv-size=18.84GB] [average-speed=247.7MB/s]
[tidb@localhost backup]$ ./br log  status --task-name=pitr -u 192.168.31.201:2379
Detail BR log in /tmp/br.log.2023-11-03T23.08.35+0800 
● Total 1 Tasks.
> #1 <
              name: pitr
            status: ● NORMAL
             start: 2023-11-03 22:59:39.647 +0800
               end: 2090-11-18 22:07:45.624 +0800
           storage: local:///data/backup/log
       speed(est.): 119925.14 ops/s
checkpoint[global]: 2023-11-03 23:06:55.546 +0800; gap=1m41s

# Here, suppose there is a problem with the cluster and emergency recovery to the failure point is needed.
[tidb@localhost backup]$ ./br log stop -u 192.168.31.201:2379 -s /data/backup/log --task-name=pitr
Detail BR log in /tmp/br.log.2023-11-03T23.10.32+0800 
[2023/11/03 23:10:32.676 +08:00] [INFO] [collector.go:77] ["log stop"] [streamTaskInfo="{taskName=pitr,startTs=445388837262983175,endTS=999999999999999999,tableFilter=*.*}"]
[2023/11/03 23:10:36.245 +08:00] [INFO] [collector.go:77] ["log stop success summary"] [total-ranges=0] [ranges-succeed=0] [ranges-failed=0] [total-take=3.610060354s]

# Found that it is impossible to use br restore point to restore to a consistent point in time because the v1_stream_truncate_safepoint.txt records the last cleanup time point from the manually passed --until parameter.
[tidb@localhost backup]$ ./br restore point -u 192.168.31.201:2379 --storage='/data/backup/log' --full-backup-storage='/data/backup/data' --restored-ts='2023-11-03 23:06:55.546 +0800'
Detail BR log in /tmp/br.log.2023-11-03T23.13.42+0800 
Error: it has gap between full backup ts:445388945868980233(2023-11-03 23:06:33.946 +0800 CST) and log backup ts:445411586998272000(2023-11-04 23:06:03 +0800 CST). : [BR:Common:ErrInvalidArgument]invalid argument

# Need to manually modify this time point to the current time. But using SELECT conv( concat( bin( unix_timestamp('2023-11-03 22:45:00.00 +0800') * 1000 ), '000000000000000001' ), 2, 10 ); to generate the TSO corresponding to 2023-11-03 22:45:00.00 +0800 and write it into log/v1_stream_truncate_safepoint.txt does not work.

mysql> SELECT conv( concat( bin( unix_timestamp('2023-11-03 22:46:55.546 +0800') * 1000 ), '000000000000000001' ), 2, 10 ) as tso;
+--------------------+
| tso                |
+--------------------+
| 445388636958490625 |
+--------------------+
1 row in set, 1 warning (0.00 sec)

mysql> select TIDB_PARSE_TSO(445388636958490625);
+------------------------------------+
| TIDB_PARSE_TSO(445388636958490625) |
+------------------------------------+
| 2023-11-03 22:46:55.546000         |
+------------------------------------+
1 row in set (0.00 sec)


[tidb@localhost backup]$ echo "445388636958490625" >log/v1_stream_truncate_safepoint.txt 

[tidb@localhost backup]$ ./br restore point -u 192.168.31.201:2379 --storage='/data/backup/log' --full-backup-storage='/data/backup/data' --restored-ts='2023-11-03 23:06:55.546 +0800'
Detail BR log in /tmp/br.log.2023-11-03T23.41.00+0800 
Error: [BR:Common:ErrInvalidMetaFile]invalid metafile: failed to parse the truncate safepoint

Suggestion: The form of ./br log truncate --until “future time” should only clear up to the current time, and this current time should be taken from the last archived log time (because if it is the actual current time, the logs that are actually current will still exist due to the slow br log archiving, but because safepoint.txt is the current time, it will cause the slowly archived logs to not be used for recovery). Similarly, this truncate --until “specific time”, this specific time should be judged based on the logs to avoid the situation where br logs are extremely slow to archive, and although br logs before a specific time cannot be restored, br logs still exist after successful archiving, leading to the situation where there are br logs but they cannot be restored.

Additionally, for the current scenario, if the truncate_safepoint time is mistakenly written as a future time, how can it be converted to the TSO here if you want to write a past time? (I cannot change this file by re-executing log truncate --until “past time”, it always records the maximum time ever executed).

translator_bot · June 21, 2024, 3:48pm

| username: Billmay表妹 | Original post link

In the future, the starting ts for PITR recovery and the ts that can be deleted will be stored in system tables, so similar issues won’t occur. This is somewhat similar to the requirements in Log into Atlassian - PingCAP JIRA. It is already being planned.

translator_bot · June 21, 2024, 3:48pm

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.