Why doesn't TiDB follow Oracle's approach of using redo log writes as a marker for transaction commit success?

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIDB为什么不学习oracle以redo log写入为事务commit成功的标志

| username: vincentLi

Oracle considers a transaction successfully committed when the redo log is successfully written, while TiDB considers it successful only after the apply phase is completed. Would it be better to consider it committed at the end of the raft committed phase in RocksDB?

| username: 哈喽沃德 | Original post link

TiDB is multi-node and needs to ensure consistency.

| username: dba远航 | Original post link

This is a distributed environment, it’s different.

| username: 随缘天空 | Original post link

The principles are different. TiDB controls distributed transactions using MVCC.

| username: zhanggame1 | Original post link

Oracle commits the transaction, and the latest data is stored in memory while waiting to be written to disk. A distributed database needs to synchronize the data to the majority of nodes.

| username: linnana | Original post link

TiDB’s distributed transactions ensure strong consistency. In the Raft log committed phase, the data is already persisted, meaning multiple replicas have received the Raft data and returned a message to the leader. However, the application layer transaction commit is only considered complete when the Raft log is applied to the RocksDB key-value store.

| username: Kongdom | Original post link

The difference between distributed and centralized systems becomes apparent immediately.

| username: 小龙虾爱大龙虾 | Original post link

When Oracle commits, to ensure linear consistency, it needs to modify the undo segment header transaction table (in memory, not immediately persisted) to make the transaction visible to other sessions, and write to redo (must be written to disk) to ensure transaction durability, so that the instance can recover even if it crashes.
When TiDB commits, to ensure linear consistency, it needs to wait for the apply to complete so that other sessions can see it. Before this, it must ensure that the raft layer commits and writes to multiple nodes to ensure that data is not lost even if a single node crashes.

| username: TIDB-Learner | Original post link

rocksdb with raft

| username: wangccsy | Original post link

The architecture is the same.

| username: 江湖故人 | Original post link

My understanding is that once committed, data must not be lost, i.e., durability. Therefore, in a single-node Oracle setup, it is sufficient to write the local redo log by default. However, in a distributed TiDB setup, it is necessary to ensure that the majority of nodes have completed the write. This is because if the node that successfully wrote locally goes down, the other two nodes still have the old data, which cannot guarantee the durability of the commit.

| username: 江湖故人 | Original post link

Oracle by default waits for the redo log to be written to disk before returning a successful commit. In a few non-interactive scenarios, you can use “commit write nowait” to achieve the asynchronous commit you mentioned.

| username: TiDBer_lBAxWjWQ | Original post link

In a distributed system, you have to wait for synchronization with other nodes.

| username: kelvin | Original post link

This is a distributed environment.

| username: 胡杨树旁 | Original post link

Doesn’t Oracle also support transactions? If it supports transactions, doesn’t it have to follow MVCC? How does Oracle control transactions?

| username: vincentLi | Original post link

In fact, the redo log being written to the file by the LGWR process provides the conditions for transaction recovery. This corresponds to the commit of the RocksDB raft log. The apply process corresponds to Oracle block data being written back to the disk. So it seems like it should be doable.

| username: 路在何chu | Original post link

The design concepts are different, there’s no need to learn from it.

| username: zhanggame1 | Original post link

TiDB should correspond to writing to the WAL of RocksDB, which stores the data.

| username: TiDBer_lmKxZw6J | Original post link

So how should we understand the parameter enable-async-apply-prewrite? If it is set to true, does it mean we don’t have to wait for apply? Will it affect read consistency?

| username: 这里介绍不了我 | Original post link