Questions about the Commit Phase in TiDB 2PC

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 关于TiDB 2PC中commit 阶段疑问

| username: yuqi1129

During the commit phase, TiDB will clear the lock and write the version data. Can these two operations guarantee atomicity?

The TiDB encoding is roughly as follows:

Data column
${key}${start_ts} —> ${value}
Version column:
${key}
${commit_ts} → ${start_ts}
Lock column
${key}–>${start_ts, primary_key, …etc}

In the commit phase, the primary will clear the lock information in the lock column and write commit_ts into the version column. These should be two different rows. For RocksDB, it cannot guarantee the atomicity of these two operations (the two rows may fall into different regions, atomicity can be guaranteed within a region, but not across regions). Can any expert help clarify this?

| username: WalterWj | Original post link

Yes, because the lock is arbitrarily selected to add a primary lock to a row, and other locks are pointing locks, which is equivalent to a direction. After the primary lock is cleared, other pointing locks are cleared asynchronously. When querying, it will confirm whether the lock pointed to by the pointing lock exists, so there is no problem across regions and nodes.

| username: yuqi1129 | Original post link

The primary will clear the lock information of the lock column and write the commit_ts into the version column.

Since clearing the lock and writing the version column are not on the same row, they may not fall into the same region, so atomicity cannot be guaranteed, right?
Did I miss some context?

| username: WalterWj | Original post link

It is possible to add a pointing lock on other rows.

| username: yuqi1129 | Original post link

Sir, I was referring to modifying the commit_ts and version column of the primary lock during the commit phase.

I understand about the pointing lock. As long as the primary modification is successful, the transaction is successful. The secondaries can be completed asynchronously, and even if they fail, it doesn’t matter because there is a mark pointing to the primary.

In Percolator, the lock column and version column information share the same row_key, so modifications to different column families of the same row are indeed atomic. On TiDB, I see that the lock column and version column use different row_keys. How can atomicity be ensured?

| username: TiDBer_小阿飞 | Original post link

The image you uploaded cannot be processed for translation. Please provide the text content directly for translation.

| username: TiDBer_小阿飞 | Original post link

Two-Phase Commit (2PC)

RocksDB Column Family Abbreviations:
D Column: rocksdb.defaultcf
L Column: rocksdb.lockcf
W Column: rocksdb.writecf

Before a transaction is committed, TiDB caches all the data. There have been instances where concurrent writes of hundreds of MBs of data caused TiDB to OOM.

  1. TiDB selects a Key from the current rows to be written as the Primary Key for the current transaction, and the remaining Keys are secondary.
  2. TiDB obtains the write routing information for all data from PD and categorizes all Keys according to the routing.
  3. TiDB initiates a prewrite request, writing the Primary Key and data to TiKV and locking it [before locking, it checks for write conflicts].

After successfully locking, the following operations are performed:
(1) Lock information is written to the L column, example: <1,(W,pk,1,100 … )>
(2) Row data is written to the D column, example: put<1_100,‘A Happy Family’>

  1. Then, the Secondary Keys concurrently initiate prewrite requests to all involved TiKV nodes, similar to the Primary Key process.
    The difference is that the lock information points to the Primary Key:
    (1) Lock information is written to the L column, example: <2,(W,@1,2,100 … )>

  2. TiDB receives confirmation that all prewrites are successful.

  3. TiDB obtains commit_ts from PD.

  4. TiDB initiates the second phase commit for the Primary Key on TiKV.
    (1) Writes to the W column, example: put<1_110,100>
    (2) Deletes from the L column, example: <1,(D,pk,1,100 … )>
    (3) Finally, cleans up the lock information.

  5. After the Primary Commit is successful, the Secondary can be committed asynchronously.

  6. TiDB receives confirmation that the two-phase commit is successful.
    [Before locking, it checks for write conflicts]

  • Checks the L column to see if another client has already locked it (Locking).
  • Checks the W column to see if there are any updates [startTs, +Inf) committed after the start time of this transaction (Conflict).

If a conflict occurs during prewrite, the current transaction rolls back.
If a conflict occurs during Primary Commit, the entire transaction rolls back.
https://zhuanlan.zhihu.com/p/596187185

| username: yuqi1129 | Original post link

  1. TiDB initiates the second phase commit to the TiKV where the Primary Key is located
    (1) Write to column W, example: put<1_110,100>
    (2) Delete column L, example: <1,(D,pk,1,100 … )>
    (3) Finally, clean up the lock information

Is the entire step 7 atomic? What happens if (1) succeeds but (2) fails? Or is the entire 1, 2, 3 an atomic operation? How is this atomic operation implemented? The row_keys of columns W and L are different, so they can’t be guaranteed to fall into the same region, right?

| username: TiDBer_小阿飞 | Original post link

  • When modifying data from memory to TiKV nodes, TiKV nodes use three column families (CF) to store modified data (Default), lock information (Lock), and commit information (Write) separately.
  • During the write process, the Default column family stores not only the table column values but also the timestamps. For example, in the image above, 3 is the column value, and in TiKV it is stored as 3_100, where 3 is the column value and 100 is the start_ts value.
  • Only the first row of the transaction is given a primary lock (pk) and written into the Lock column family, with other locks pointing to this lock.
    Commit phase:
  • After obtaining the commit timestamp from PD, the commit information is first written into the Write CF, as shown in the image with put <3_110,100>, where the elements are 3 (transaction ID), 110 (commit timestamp), and 100 (start timestamp).
  • The lock cleanup is written into the Lock CF, i.e., <3,(D,pk,3,100…)>, where D indicates Delete.
    These three column families essentially have only one lock, and the entire process is completed before writing.
| username: heiwandou | Original post link

There are still some differences in two-phase commit.

| username: yuqi1129 | Original post link

I understand, it’s still using the mechanism of atomic updates with different column families in the same row.

| username: yuqi1129 | Original post link

Let me ask one more question. Currently, RocksDB does not support storing multiple versions (storing multiple versions of values on a CF) like HBase. How does the write column you mentioned store multiple versions? Are all version data written together and retrieved all at once during reading? If so, how do you quickly find the readable version based on the read snapshot?

| username: TiDBer_小阿飞 | Original post link

The underlying layer of RocksDB is essentially a bunch of key-value pairs. In TiDB, the multi-version data storage in TiKV is implemented through MVCC (Multi-Version Concurrency Control) by adding version numbers to the keys. For example:

Before MVCC:
Key1 → Value
Key2 → Value

After MVCC, the arrangement of keys in TiKV is as follows:
Key1_Version3 → Value
Key1_Version2 → Value
Key1_Version1 → Value
……
Key2_Version3 → Value
Key2_Version2 → Value
Key2_Version1 → Value
……

For multiple versions of the same key, the larger version numbers are placed in front, and the smaller version numbers are placed behind. This way, when a user retrieves a value using a Key + Version, they can construct an MVCC key using Key and Version, which is Key_Version. Then, they can directly use the SeekPrefix(Key_Version) API of RocksDB to locate the first position greater than or equal to this Key_Version.

When users are certain they need a longer read time, such as when using Mydumper for full backups (Mydumper backs up consistent snapshots), they can adjust the tikv_gc_life_time value in the mysql.tidb table in TiDB to extend the MVCC version retention time. It is important to note that the tikv_gc_life_time configuration takes effect globally immediately. Increasing it will extend the lifespan of all current snapshots, while decreasing it will immediately shorten the lifespan of all snapshots. Too many MVCC versions can slow down TiKV’s processing efficiency.

| username: TiDBer_小阿飞 | Original post link

How is version information represented when storing multiple versions of data for each key?
In TiKV, we simply combine the key and timestamp into an internal key to store in RocksDB. Below is the content of each CF:
CF_DEFAULT: (key, start_ts) → value
CF_LOCK: key → lock_info
CF_WRITE: (key, commit_ts) → write_info
The method of combining the key and timestamp is as follows:
Encode the user key in a memcomparable format;
Invert the bits of the timestamp, then encode it in big-endian format;
Append the encoded timestamp to the encoded key.
For example, the key “key1” and timestamp 3 will be encoded as “key1\x00\x00\x00\x00\xfb\xff\xff\xff\xff\xff\xff\xff\xfe”. This way, different versions of the same key are adjacent in RocksDB, and data with larger versions is in front of older version data.

| username: andone | Original post link

Sure.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.