Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: TIDB为什么不学习oracle以redo log写入为事务commit成功的标志
Oracle considers a transaction successfully committed when the redo log is successfully written, while TiDB considers it successful only after the apply phase is completed. Would it be better to consider it committed at the end of the raft committed phase in RocksDB?
TiDB is multi-node and needs to ensure consistency.
This is a distributed environment, it’s different.
The principles are different. TiDB controls distributed transactions using MVCC.
Oracle commits the transaction, and the latest data is stored in memory while waiting to be written to disk. A distributed database needs to synchronize the data to the majority of nodes.
TiDB’s distributed transactions ensure strong consistency. In the Raft log committed phase, the data is already persisted, meaning multiple replicas have received the Raft data and returned a message to the leader. However, the application layer transaction commit is only considered complete when the Raft log is applied to the RocksDB key-value store.
The difference between distributed and centralized systems becomes apparent immediately.
When Oracle commits, to ensure linear consistency, it needs to modify the undo segment header transaction table (in memory, not immediately persisted) to make the transaction visible to other sessions, and write to redo (must be written to disk) to ensure transaction durability, so that the instance can recover even if it crashes.
When TiDB commits, to ensure linear consistency, it needs to wait for the apply to complete so that other sessions can see it. Before this, it must ensure that the raft layer commits and writes to multiple nodes to ensure that data is not lost even if a single node crashes.
The architecture is the same.
My understanding is that once committed, data must not be lost, i.e., durability. Therefore, in a single-node Oracle setup, it is sufficient to write the local redo log by default. However, in a distributed TiDB setup, it is necessary to ensure that the majority of nodes have completed the write. This is because if the node that successfully wrote locally goes down, the other two nodes still have the old data, which cannot guarantee the durability of the commit.
Oracle by default waits for the redo log to be written to disk before returning a successful commit. In a few non-interactive scenarios, you can use “commit write nowait” to achieve the asynchronous commit you mentioned.
In a distributed system, you have to wait for synchronization with other nodes.
This is a distributed environment.
Doesn’t Oracle also support transactions? If it supports transactions, doesn’t it have to follow MVCC? How does Oracle control transactions?
In fact, the redo log being written to the file by the LGWR process provides the conditions for transaction recovery. This corresponds to the commit of the RocksDB raft log. The apply process corresponds to Oracle block data being written back to the disk. So it seems like it should be doable.
The design concepts are different, there’s no need to learn from it.
TiDB should correspond to writing to the WAL of RocksDB, which stores the data.
So how should we understand the parameter enable-async-apply-prewrite? If it is set to true, does it mean we don’t have to wait for apply? Will it affect read consistency?