How does TiDB ensure that the delete range scenario does not mistakenly delete ranges that might still be written to?

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB是如何保证delete range场景下不误删还有可能写入的range的?

| username: Yriuns

I thought of a corner case and looked at the TiDB code but didn’t find any special handling for it, which makes me very curious.

Let’s assume there is a write transaction T and a GC leader.

We know the execution sequence of an optimistic transaction is:

  1. Get start ts
  2. Perform read and write operations in TiDB memory
  3. Concurrently prewrite primary key and secondary keys
  4. Get commit ts
  5. Commit primary key
  6. Commit secondary keys

And the execution sequence of the GC leader is:

  1. Calculate safe point: min(current time - gc_life_time, the minimum start_ts of active transactions on each TiDB node)
  2. Execute scan locks request
  3. Execute resolve locks request
  4. Execute delete range
  5. Clean up MVCC

Assume the user drops a certain index idx, and the online DDL has progressed to the none state, but T’s schema is still in the delete only state.

Is there a possible sequence of events:
0. T chooses a key in the range corresponding to idx as the primary key

  1. T’s prewrite primary & secondary requests are stuck on the network and timed out, and the session is killed, so T.start_ts will not be counted in the safe point
  2. GC leader calculates the safe point as current time - gc_life_time
  3. GC leader executes scan locks, obviously not scanning the on-the-fly secondary locks in step 1
  4. GC leader executes resolve locks, obviously not handling the locks in step 1
  5. T’s prewrite request reaches TiKV and locks successfully
  6. GC leader deletes the range corresponding to idx, i.e., the primary key is deleted
  7. When resolving the secondary key, it finds that the primary key cannot be found and does not know whether to commit or roll back
| username: xfworld | Original post link

MVCC is enough,

Deleting is one version, writing is another version, no conflict…

| username: Yriuns | Original post link

Thank you for the reply, but the scenario I mentioned is delete range, where all versions will be deleted.

| username: Fly-bird | Original post link

Transaction consistency?

| username: xfworld | Original post link

Whether it’s deletion or writing, without considering transaction issues or locks, how can you ensure that the scenario you mentioned can be realized?

MySQL can do it, so why can’t TiDB… :upside_down_face:

| username: Yriuns | Original post link

The scenario I described is that after the user initiates a drop index, TiDB executes delete range in the background. At this time, there is a concurrent transaction T, and the timing of transaction T also conforms to the current implementation of TiDB. If you think it is impossible to happen, please point out which step has a problem, rather than being sarcastic.

| username: Yriuns | Original post link

Could you please be more specific?

| username: xfworld | Original post link

drop index → DDL?
delete range data → DML?

Sorry, could you be more specific? PS: Where was the sarcastic tone?

| username: Yriuns | Original post link

DROP INDEX is a DDL operation, and DELETE RANGE is the underlying implementation of DDL, not DML. Refer to the description in “The ‘Delete Ranges’ phase quickly deletes obsolete data in whole ranges generated by operations such as DROP TABLE/DROP INDEX.”

In the example I provided, the DML is another regular write transaction T, whose schema might be one version older than the current schema.

Sorry, :upside_down_face: this emoji looks a bit uncomfortable.

| username: xfworld | Original post link

Now I understand what you want to know, which is the implementation mechanism of F1.
This is very complicated, so I recommend some documents for you to read slowly:

Additionally, TiDB has courses that describe these contents in more detail, which you can directly study:

Teacher Bao explains it in great detail.

Whether it is DDL or DML, the foundation is still region manipulation. From the perspective of the TiKV layer, it is in KV mode.
So the data processing process is the same, but the version management and intervention mechanisms are different… I can only say it is very complicated…

| username: Yriuns | Original post link

I understand the principles and implementation of F1, but I feel that the timing of the corner case mentioned in the example is possible.

| username: xfworld | Original post link

This needs to be proven with examples, which is quite difficult.

All scenarios are based on assumptions, making it even harder to discuss…

| username: Yriuns | Original post link

Alright, thank you anyway~

| username: ljluestc | Original post link

Based on the described situation, there indeed exists a sequence that could lead to issues, especially when online DDL operations and optimistic transactions are conducted simultaneously. Online DDL operations might cause index deletions, while ongoing transactions might involve these indexes during optimistic locking. This could lead to problems for the GC Leader when cleaning up transactions.

To resolve this issue, special case handling could be added to the GC Leader’s processing logic. Here is a possible solution approach:

When the GC Leader calculates the safe point, in addition to computing min(current time - gc_life_time, the minimum start_ts of active transactions on each TiDB node), it should also consider ongoing online DDL operations.

When the GC Leader performs scan locks, it needs to identify locks related to online DDL operations and ensure these locks are scanned and processed before the safe point. This might require maintaining metadata for online DDL operations and their related lock information.

When the GC Leader performs resolve locks, it needs to handle locks related to online DDL operations, even if their primary keys have been deleted. This might require special logic to determine how to handle these locks, such as committing or rolling back.

Ensure that the GC Leader does not ignore locks related to online DDL operations during the cleanup process.

| username: 有猫万事足 | Original post link

The description of Resolve Locks is as follows, and I think it can fully answer your question.

In essence, it checks whether there is a commit_ts in the Primary. If not, it means the transaction is incomplete and will be cleaned up.

Or I feel I can translate your question, that is, if a large optimistic transaction cannot be completed within the GC interval, will it never be completed?
I think the answer is yes. For example, with the default GC time of 10 minutes, if a large optimistic transaction cannot be completed within 10 minutes, it will never be completed without modifying the GC interval. This also explains why during large data imports with Lightning or long OLAP calculations with TiSpark, TiDB components will stop the advancement of the safepoint.

| username: xfworld | Original post link

He wants to prove his conjecture and hypothesis, using TiDB to demonstrate this process. :blush:

| username: 有猫万事足 | Original post link

I think he wants to prove that there might be a simpler way.

With the code available, you can debug it and directly modify the code to sleep at certain points. This can reproduce any scenario.
There are indeed cases where discussions have discrepancies in descriptions or understanding.

| username: ti-tiger | Original post link

async commit.

Async commit is a feature that optimizes TiDB transaction performance. It allows the transaction to return a success result during the prewrite phase without waiting for the commit phase to complete. This reduces transaction latency and increases throughput.

The principle of async commit is that during the prewrite phase, TiDB generates a commit ts for each key and writes it into the lock. This commit ts is determined by the larger value between the current start ts and the maximum ts, and it also satisfies the condition commit ts > start ts.

After the prewrite phase is completed, TiDB selects the smallest commit ts from the locks of all keys as the commit ts for the current transaction and returns it to the client. Once the client receives the commit ts, it can consider the transaction successfully committed.

In the background, TiDB continues to execute the commit phase, marking all key locks as committed and updating the write cf with their respective commit ts. If a lock of a key is discovered by another transaction, the other transaction can determine whether the key has been committed based on the information in the lock and read the key’s value using the same commit ts.

With async commit, TiDB can ensure snapshot isolation (SI) level for transactions and avoid some anomalies caused by network latency or GC. For example, in the corner case you mentioned, since the primary key of T has been deleted by GC, the secondary key cannot find the primary key information during resolution. However, with async commit, the secondary key of T can obtain the commit ts from its own lock and correctly commit or roll back.