Questions about PiTR Consistency

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: PiTR一致性的疑问

| username: TiDBer_VDhlK8wi

After reading the PiTR documentation for TiDB V6 (PITR 功能介绍 | PingCAP 归档文档站), I have some questions:

Assuming a distributed transaction DT1 is executed on three nodes:
Node A: 1s800ms
Node B: 1s950ms
Node C: 2s100ms
The binlog on each node should be recorded according to this time. If I specify a recovery time point of 2s000ms, nodes A and B should roll back successfully, but node C, which commits at 2s100ms, should not load successfully. This would result in the distributed transaction DT1 missing a sub-transaction on node C.

In TiDB’s PiTR design, will it adjust the recovery time of node C slightly forward to roll forward this 2s100ms sub-transaction?

| username: yilong | Original post link

The commits in the two-phase commit are the same.

| username: TiDBer_VDhlK8wi | Original post link

Thank you for the reply. I believe the commit ts of the two-phase commit can remain consistent because the TSO obtained will be given to multiple nodes of the distributed transaction. However, the execution on different nodes cannot be guaranteed to complete at the same time, as it is impossible for all nodes to communicate and complete together (due to hardware delays, network delays, etc.). Therefore, the completion times recorded in the binlogs of different sub-transactions will definitely differ. My question is precisely here: if nodes A, B, and C complete at different times, and I specify a time to roll back, if A, B, and C accept my time without handling the consistency of this distributed transaction recovery, it will result in the loss of sub-transactions. And if we need to record this distributed transaction, we need to save all the nodes involved in the distributed transaction.

This issue does not occur during production but arises during PiTR (Point-in-Time Recovery). I need to understand more details about the implementation of PiTR, specifically regarding the consistency of distributed transactions.

| username: IANTHEREAL | Original post link

When performing the backup required for PITR, nodes A, B, and C will periodically record their own local checkpoint ts, indicating that data with txn commit ts earlier than the checkpoint ts on that node has already been backed up. Then, the minimum checkpoint ts of all nodes is calculated to be the global checkpoint of the backup.

When performing a point-in-time recovery, PITR can only restore to a point before the global checkpoint, ensuring that no data is lost.

This is the guarantee of PITR and is unrelated to the data write time. For example, according to the example given above, if DT1’s commit ts < 2s000ms, and the user specifies a recovery to 2s000ms, but DT1’s data on node C has not been backed up yet, PITR will report an error and cannot perform the recovery.

You can refer to the design overview here: docs-cn/br/br-log-architecture.md at 57c080287da40c3b4f44ff8684c045fc818a140e · pingcap/docs-cn · GitHub

| username: maxshuang | Original post link

I understand that the original poster is concerned that even if distributed transactions obtain the same commitTs, the primary key will consider the transaction committed once it persists the commitTs, while other keys are processed asynchronously, potentially causing PiTR to miss some data.

This won’t happen. Before other keys persist the commitTs, there will be a startTs lock record on the other keys. When PiTR encounters such a lock record, it will advance or delete the lock based on the transaction status, and only after handling the lock will it advance the local checkpointTs.

In other words, even if the local checkpointTs of other nodes have advanced to a certain commitTs 10, as long as there is still one node where commitTs 10 has not been committed, the local checkpointTs of that node will be 9. At this point, the global checkpointTs will still be 9, and it cannot ensure the complete recovery of the transaction with commitTs 10.

| username: TiDBer_VDhlK8wi | Original post link

Thank you both for your replies. I have learned a lot, but I still have some questions. I have drawn a diagram.

Assume there are three nodes A, B, and C, and a full backup was done at T1.

Then at T2, there are three distributed transactions TD1, TD2, and TD3, with their respective start ts and commit ts shown in the diagram.

The first question is: When TiDB performs PiTR, does each node’s binlog recovery specify commit ts to restore the logs? Or is it similar to MySQL’s binlog where you can only specify position or local timestamp to restore?

In the first question, if it specifies commit ts to restore, then it requires the distributed transaction on each node to merge the logs in the binlog into a complete transaction like TD1, TD2, TD3 for log loading. Is this how it is implemented?

If it is similar to MySQL where you specify the local position/ts to restore, then it must ensure that the ts obtained by PD is processed first, and other transactions can only be committed after this transaction is successfully committed. Otherwise, TD1, TD2, and TD3 will not be in order but in a disorderly manner.

In this case, I can easily simulate a transaction TD1. After obtaining commit ts=ts2 on node B, it completes normally, and the local ts of node B is also ts2 (NTP ensures that the time between nodes is consistent, but there can still be more than 200ms of error). However, on node A, for some reason, the local commit ts of TD1 is not ts2 but may be ts8 or later.

At this time, if we use PiTR to specify a point in time ts2 to load the logs, node B will almost normally load the transaction TD1, but node A will inevitably miss it because the local commit ts of TD1 is ts8, causing TD1 to appear on node B but not on node A during PiTR.

In @maxshuang’s reply, I understand that this is handled at the backup time point, but is it also handled this way during PiTR? Because PiTR definitely only loads logs.

In @IANTHEREAL’s reply, I understand that you want to get a global checkpoint ts, but if it is only derived from the smallest local commit ts of each node without considering which nodes are involved in all distributed transactions at that time, it will still result in inconsistencies in the recovery of some distributed transactions.

| username: TiDBer_VDhlK8wi | Original post link

Could the experts please continue to explain this issue?