What is the general process and principle of the Pessimistic Lock Fair Locking Client?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 悲观锁 Fair Locking 客户端的流程原理大概是什么呢

| username: ylldty

Below is the information and principles about Fair Locking that I roughly collected through the documentation:

Pessimistic Fair Lock Optimization Fair Locking

Enhanced pessimistic lock queueing RFC: https://github.com/tikv/rfcs/pull/100/files?short_path=b1bee83#diff-b1bee83cbc9b96a0f2b6ddcd382ac9d1b97f41d2c8aa840bf9043520af3d86bb

If there are frequent single-point pessimistic lock conflicts in the business scenario, the original wake-up mechanism cannot guarantee the time for transactions to acquire locks, resulting in high tail latency and even lock acquisition timeouts. We can imagine this scenario:

  • Suppose there are many pessimistic transactions (t1, t2, …) in TIKV locking the same row of data, all waiting in a queue for the current transaction t to commit.

  • At this time, t commits the transaction, and TIKV wakes up one of the transactions t1 to continue processing. t1 finds that the data to be locked has been updated and returns WriteConflict to the client (TIDB) to retry the locking process.

  • Coincidentally, at this time, t2 is woken up after exceeding the wake-up-delay-duration time and will also attempt the locking process.

  • For some reason, when t1’s request reaches TIKV, t2 has already completed the locking, so t1’s entire process is equivalent to a no-op.

To solve this problem, TIKV has made a series of optimizations for pessimistic transactions. Let’s repeat the above scenario:

  • Currently, there are many pessimistic transactions (t1, t2, …) in TIKV locking the same row of data, all waiting in a queue for the current transaction t to commit.

  • At this time, t commits the transaction, and TIKV will wake up the earliest requested transaction t1 to continue processing. Unlike before, t1’s locking process will succeed, even if it finds that the data to be locked has been updated, it will not return a WriteConflict error. However, this successful request will carry the latest write record’s commit_ts to notify the client that although the lock was successful, there is actually a data conflict.

  • The wake-up-delay-duration time will not take effect, so no other transactions will suddenly wake up to compete with transaction t2.

  • After the result of t1’s successful locking reaches the client, since the commit data has changed, the client may still retry with the latest ts.

However, I am not very familiar with the related processes on the client side, for example:

How does the client operate after receiving the fair locking success? Also, even if the client receives the fair locking success, it will still retry. Is there a specific example scenario to describe this?

| username: xfworld | Original post link

It is recommended to refer to the official documentation and then combine it with the code to better understand…

Distributed locks are inherently complex…

You can refer to the following:

| username: ylldty | Original post link

Hello, I have referred to these documents, but there is indeed no information on fair locking. The RFC documentation is quite detailed, but it may lack some examples, making it a bit difficult to understand. In comparison, the code logic on TIKV is relatively easier to understand, but there is less introduction on retry and lock retry on the client side. The code comments in TiDB are not very easy to understand.

| username: xfworld | Original post link

Well, from the perspective of databases, there are only two types of locks: optimistic locks and pessimistic locks.

If you look at this issue from a system perspective, then there are many more types of locks:

  • Fair locks and unfair locks
  • Spin locks
  • Biased locks
  • Lightweight locks
  • Heavyweight locks

The perspective is different, so are you looking for information on TiDB’s distributed locking mechanism or system mechanisms?

| username: ylldty | Original post link

I am referring to this optimization of the pessimistic lock.

| username: xfworld | Original post link

The title is very clear, using a queueing method, which has nothing to do with fair locks…

The basic idea of the optimization is: when a pessimistic lock request is woken up after being blocked by another lock, try to grant it the lock immediately, instead of returning to the client and letting the client retry. By this way, we avoid a transaction being woken up and failing to get the lock due to the lock being preempted by another transaction.

When a lock is released and a queueing pessimistic lock request is woken up, allow the latter request to resume and continue acquiring the lock (it’s very likely to succeed since we allow locking with conflict).

It’s just to reduce lock conflicts, but there are still potential issues, and it seems there is no solution yet…

| username: 哈喽沃德 | Original post link

TiDB Pessimistic Lock Fair Locking is a locking strategy used by TiDB when implementing pessimistic locks. Fair Locking ensures fair competition for resources in concurrent situations. The general process on the client side is as follows:

  1. Client requests a pessimistic lock: When a client needs to acquire a pessimistic lock on a resource, it sends a request to TiDB.
  2. TiDB receives the request: After receiving the client’s request, TiDB processes it according to the Fair Locking strategy.
  3. Fair Locking strategy: Fair Locking ensures fair competition for resources among different transactions. When multiple transactions request a pessimistic lock on the same resource, TiDB decides which transaction can acquire the lock based on certain rules, avoiding situations where one transaction holds the lock for too long, preventing other transactions from executing.
  4. Acquiring the lock: Ultimately, according to the Fair Locking strategy, TiDB allocates the lock to a transaction and notifies the client that the lock has been acquired.
  5. Executing operations: After acquiring the pessimistic lock, the client can perform operations on the resource.
  6. Releasing the lock: Once the client completes the operations on the resource, it releases the pessimistic lock, allowing other transactions to compete for the lock on the resource.

In summary, TiDB’s Fair Locking mechanism ensures fair competition for resources in concurrent situations, avoiding performance issues and unfair competition caused by pessimistic locks.

| username: ylldty | Original post link

Are there any examples, such as client operations involving CRUD:
For example,
Does the client always need to retry after receiving fair locking? When is a retry necessary?
Also, under what circumstances after fair locking does a retry find that locking is no longer needed but clearing the lock is required?
The code in this area is relatively difficult to understand, and there is still relatively little information.

| username: ylldty | Original post link

Well, this is the Fair Lock, which was later renamed by the official team to Fair Locking. The goal is to acquire locks in a first-come, first-served manner to prevent a large number of lock conflicts. The storage layer is relatively easy to understand, but when combined with select/update/insert/delete statements, there is still a lack of specific understanding of how the client coordinates with the overall process.

| username: xfworld | Original post link

The document I shared with you describes the mechanism of distributed locks in great detail…

For resource management, the following document will be clearer:

| username: ylldty | Original post link

Thank you for your reply and for sharing the distributed lock documentation. However, I haven’t seen detailed documentation on Fair Locking yet, especially regarding the design document of Fair Locking and the description of this logic in the image.

| username: 有猫万事足 | Original post link

I didn’t find the term “fair” in this link, so I feel that even if there are optimizations, it is probably not called fair locking. The corresponding issue link found from this RFC is:

It is still open, and it seems to be a big project that hasn’t been completed yet.

From this, you can find some client-go related issue links.

From this modification record, you can see the following segment:

//
// Aggressive locking refers to the behavior that when a DML in a pessimistic transaction encounters write conflict,
// do not pessimistic-rollback them immediately; instead, keep the already-acquired locks and retry the statement.
// In this way, during retry, if it needs to acquire the same locks that was acquired in the previous execution, the
// lock RPC can be skipped. After finishing the execution, if some of the locks that were acquired in the previous
// execution but not needed in the current retried execution, they will be released.
//
// In aggressive locking state, keys locked by LockKeys will be recorded to a separated buffer. For LockKeys
// invocations that involves only one key, the pessimistic lock request will be performed in ForceLock mode
// (kvrpcpb.PessimisticLockWakeUpMode_WakeUpModeForceLock).
func (txn *KVTxn) StartAggressiveLocking() {
if txn.aggressiveLockingContext != nil {
panic(“Trying to start aggressive locking while it’s already started”)
}
txn.aggressiveLockingContext = &aggressiveLockingContext{
lastRetryUnnecessaryLocks: nil,
currentLockedKeys: make(map[string]tempLockBufferEntry),
startTime: time.Now(),
}
}

So I feel that this RFC in TiDB might not be called Fair Locking but rather Aggressive Locking.
As for the logic in your picture, I indeed couldn’t find it. Could you please tell me where the picture is from?

| username: ylldty | Original post link

The title of the issue linked here is Fair locking, which is actually this one:

This is the renamed issue.

This is the configuration of the system variable, which has been renamed to Fair Locking.

That picture is the design that comes with TiDB:

| username: ylldty | Original post link

I read through the design of TiDB again:

I feel like I understand most of it.

However, I’m still not quite clear about this process:

When a key is locked with conflict, the current statement becomes executing at a different snapshot. However, the statement may have already read data in the expired snapshot (as specified by the for_update_ts of the statement). In this case, we have no choice but retry the current statement with a new for_update_ts.

It says that since the SQL statement may have already read data from an old version of the snapshot, we need to retry reading the latest for_update_ts data.

Maybe my understanding of TiDB’s pessimistic lock scenarios is limited. The scenarios I understand, for example:
select * from t where id=1 for update; (no unique index in the table)
In a pessimistic transaction, this select request is likely to be blocked due to other transactions. If fair locking is used, TiKV will successfully acquire the row lock for id=1 when other transactions commit and return the latest data of id=1 along with its commit_ts to TiDB.
In this process, it seems that there is no need to retry the select for update statement.

Insert statements are probably not needed, as in most scenarios, insert should not require a query.

So what kind of SQL statements, in a pessimistic transaction, even if fair locking is successful, still need to be retried?

For delete statements and update statements, such as delete from t where k=2; update t set a=3 where k=2;
It is possible that during the blocking process, other transactions have committed new data for k=2, so TiDB needs to retry and re-query all rowids for k=2 under the new snapshot, and then lock them one by one.

For tables with unique indexes,
select * from t where indexKey=1 for update; or
select * from t where id=1 for update;
I’m not sure if it seeks the row data first and then locks the index and rowid, or locks the index and rowid first and then seeks the row data.
If it’s the former, then indeed it needs to retry the seek.
If it’s the latter, it seems unnecessary to retry.

I wonder if my guessed scenarios are correct.

| username: xfworld | Original post link

If you can understand the detailed flowchart about locking in the official documentation, you will get it. Stop guessing.

The difficulty of distributed locks is not in a single node, but in maintaining the associated state between multiple nodes.
The difference between writing a single piece of data and writing a batch of data.
The difference between updating a single piece of data and updating a batch of data (there are more scenarios here, such as A and B executing sequentially, A and B executing simultaneously).
Deleting data while querying, and possibly inserting new data at the same time… and possibly multiple people operating at the same time…

| username: ylldty | Original post link

Could you please advise if there is any official documentation that explains the locking process for statements like select for update, delete, update, insert when updating a single row or multiple rows? It would be best if it includes examples.
Thanks a lot.

| username: xfworld | Original post link

Practice leads to true knowledge. Set up an environment and simulate it to find out.

| username: 有猫万事足 | Original post link

Now the case is solved. In TiDB, this is called fair locking, but in client-go, which is the client module responsible for communicating with KV, fair locking is called aggressive locking.

| username: ylldty | Original post link

Yes, it is estimated that it will be renamed to Fair Locking later, but it hasn’t been changed yet.

| username: xfworld | Original post link

Not surprising, previous versions had differences in the same configuration names, got used to it… :see_no_evil: