Details on TiKV Rollback Records protect_rollback: Why Can It Be Unprotected?

translator_bot · June 21, 2024, 1:16am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIVK 回滚记录 protect_rollback 细节问题：为何可以不被 protect？

| username: ylldty

In the TIKV code, the Rollback interface should normally write rollback records to the write CF. This way, even if there are network issues, after rollback is completed, if prewrite is called again due to network reasons, prewrite will detect the rollback record and terminate the prewrite process.

However, there is a special scenario in the Rollback code where it does not write rollback records to the write CF. The code is as follows:

So in reality, nothing is done, not even setting overlap. How does prewrite recognize this scenario and prevent the prewrite process? Or is there some mechanism that ensures that after rollback, prewrite will not be called again due to network reasons?

Found an old comment:

github.com

tikv/tikv/blob/c3f9fba14b04811e77614bcd50007bad17e251c3/src/storage/mvcc/txn.rs#L692


      
              }
          
              // Insert a Rollback to Write CF in case that a stale prewrite command
              // is received after a cleanup command.
              // Pessimistic transactions prewrite successfully only if all its
              // pessimistic locks exist. So collapsing the rollback of a pessimistic
              // lock is safe. After a pessimistic transaction acquires all its locks,
              // it is impossible that neither a lock nor a write record is found.
              // Therefore, we don't need to protect the rollback here.
              let write = Write::new_rollback(ts, false);
              self.put_write(primary_key, ts, write.as_ref().to_bytes());
              MVCC_CHECK_TXN_STATUS_COUNTER_VEC.rollback.inc();
          
              Ok(TxnStatus::LockNotExist)
          } else {
              Err(ErrorInner::TxnNotFound {
                  start_ts: self.start_ts,
                  key: primary_key.into_raw()?,
              }
              .into())
          }

Then this comment was broken by an issue:

github.com/tikv/tikv

Txn: collapsing the rollback record of pessimistic lock could cause inconsistency

opened 11:35AM - 05 Apr 20 UTC

closed 02:46AM - 08 Apr 20 UTC

andylokandy

type/bug sig/transaction severity/critical

## Bug Report ### What version of TiKV are you using? master ### What d…id happened? A false assertion is found in: https://github.com/tikv/tikv/blob/c3f9fba14b04811e77614bcd50007bad17e251c3/src/storage/mvcc/txn.rs#L684-L691 where pessimistic transactions may commit successfully even when some of its keys it prewrited are rollbacked by cleanup-resolve procedural invoked by a duplicated cleanup command, which is sent before for cleaning up the pessimistic lock. ### Steps to reproduce Assume that we have three clients {c1, c2, c3} and two keys {k1, k2}: 1. Pessimistic client c1 acquires a pessimistic lock on k1(primary), k2. But the command for k1 is lost at this point. 2. Optimistic client c2 requires to clean up the lock on k2 3. k1 is rollbacked and a write record `("rollback", c1_start_ts, not_protected)` is written into k1 (not_protected because the lock on k1 is missing), and a `cleanup(primary=k1, ts=c1_start_ts)`(*1) is sent but lost at this point. 4. Client c3 prewrites k1 5. Client c2 requires to clean up the lock on k1 6. k1 is rollbacked and the rollback write record is **collapsed** to `("rollback", c3_start_ts, protected/not_protected)` 7. Client c1 retries to lock on k1 8. k1 is locked by c1 9. Client c1 prewrites k1, k2 10. k1, k2 are prewrited by c1, and c1 received the prewrite succeed response 11. The lost cleanup command (*1) in step 3 is received by k2, therefore k2 is rollbacked 12. Client c1 commit k1 13. k1 is committed, while k2 is rollbacked Then atomic guarantee is broken.

So subsequently, the primary of pessimistic transactions was protected…
However, does optimistic transactions not have this problem? Optimistic locks do not need to check the lock, so why can they also not be protected?

translator_bot · June 21, 2024, 1:16am

| username: neilshen | Original post link

Using Non-protected rollback for transaction T means that it is known that transaction T will definitely be rolled back, and there is no possibility of other concurrent processes attempting to commit transaction T; this situation may occur when the transaction itself actively performs a rollback (note that this is not a rollback statement, but a rollback entered after a commit fails midway).
Using Protected rollback for transaction T means that it is uncertain whether transaction T might be concurrently committed, so it is necessary to ensure that the transaction is definitely rolled back. This requires ensuring that the current rollback information written cannot be discarded under any circumstances; and any concurrent commit process that sees this rollback information will fail. This situation may occur when the transaction is resolved and decided to be rolled back by another transaction.

(Note: The above reply is from my colleague)

translator_bot · June 21, 2024, 1:16am

| username: ylldty | Original post link

Optimistic transaction t1 starts the two-phase commit

Sent prewrite, TiKV received it, but due to network issues, it was not sent to the TiDB client.
The TiDB client retried prewrite, but due to network issues, it was blocked, and TiKV did not receive it.
Optimistic lock timeout
Concurrent transaction t2 called checktxnstatus and found that the transaction had timed out, started resolve lock, and performed a rollback at the non-protected level without any rollback records.
Due to network issues, TiKV suddenly received the previous prewrite.

The scenario I described will definitely not happen, right?

translator_bot · June 21, 2024, 1:16am

| username: TiDBer_jYQINSnf | Original post link

Borrowing this thread to ask, what does panic mean in this context? I modified the code myself and encountered a panic here, but I don’t understand how it was generated.

translator_bot · June 21, 2024, 1:16am

| username: neilshen | Original post link

First, CheckTxnStatus will definitely write a protected rollback; then if your concern is that the ResolveLock process in the screenshot does not set a protected rollback, which might cause a late prewrite to succeed, theoretically, it is indeed possible, and it could happen in both optimistic and pessimistic transactions. However, it does not affect the correctness of the transaction because CheckTxnStatus must have written a protected rollback before this (more rigorously, for regular 2PC transactions, the primary must have been written with a protected rollback in CheckTxnStatus; for async commit transactions, it is also possible that a secondary writes a protected rollback during CheckSecondaryLocks). Therefore, this transaction can never enter the commit state, and the lock written by the late prewrite will eventually be cleaned up in resolve lock.

(Note: The above reply is from my colleague)

translator_bot · June 21, 2024, 1:16am

| username: ylldty | Original post link

Thank you for clarifying.
This process involves the interaction of multiple interfaces, which indeed makes it quite difficult to understand. I might add sufficient comments to this code later on, so that future contributors can understand it more easily.

translator_bot · June 21, 2024, 1:16am

| username: ylldty | Original post link

From the CheckTxnStatus code, it seems that if the primary lock to be checked times out, the rollback record written is not of the protected type? This seems to differ from the logic you mentioned.

translator_bot · June 21, 2024, 1:16am

| username: redgame | Original post link

Optimistic locking does not check the status of the lock, so it may not need to be protected like pessimistic transactions.

translator_bot · June 21, 2024, 1:16am

| username: TiDBer_aaO4sU46 | Original post link

Both optimistic and pessimistic scenarios are possible, but they do not affect transaction correctness.