Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: 为什么raft log都committed了。用户transaction还不能返回提交成功。写的操作不应该是在内存中有数据么?此时虽然没有apply到kv里面。但是应该在内存中应该可以读到了。那么就应该可以给客户端返回提交成功了吧?
Why can’t the user transaction return a commit success even though the raft log is committed? Shouldn’t the write operation have data in memory? Although it hasn’t been applied to the KV yet, it should be readable in memory at this point. So it should be able to return a commit success to the client, right?
In a distributed system based on the Raft consensus algorithm, even if log entries have been committed, it does not mean that the user’s transaction can immediately return a successful commit to the client. This is mainly because the design goal of the Raft algorithm is not only to ensure data safety (through consensus) but also to ensure data consistency and durability.
At this point, if the machine suddenly fails, the data in the memory will be lost, but returning a successful submission is incorrect, right? The data hasn’t actually been written to the disk.
This is to ensure data consistency and durability. If there is a crash at this time, the data in memory will be lost. It needs to be applied to RocksDB KV to be considered permanently successful.
After a centralized database commits the redo log, it can be queried because the corresponding data is also stored in memory, allowing queries to be returned directly. Generally, redo and undo logs are only used for recovery when the database crashes.
In a distributed database like TiDB, there are no redo and undo logs; instead, it uses Raft logs. Queries are based on the returns from the KVDB. If a write is committed to Raft, it is considered successful. However, if a leader node switch occurs or a query is made from a follower node (which does not have detailed memory information and cannot directly return based on Raft log information), and the data corresponding to the Raft log has not yet been written to the KVDB, the data committed to the Raft log cannot be queried.
For reference only.
The Raft consensus algorithm requires a majority, meaning the majority must persist; otherwise, data might be lost if a failure occurs.
Most of the raft logs should be committed.
That’s not right. The raft log is written to RocksDB raft, so it should be persisted, just not applied to RocksDB KV yet.
The Raft log written to RocksDB Raft should be persisted, but it has not been applied to RocksDB KV yet. Moreover, it has already been replicated to the RocksDB Raft on other nodes.
The situation you mentioned should not occur, and it’s not that the content about the Raft algorithm mentioned above is incorrect.
Instead, TiDB has made this optimization. It allows for asynchronous commit. Moreover, this asynchronous commit parameter is set to “on” by default unless it is upgraded from a version below 5.0.
There are also some columns introducing related content for reference.
Additionally, there is another parameter:
which allows enabling the one-phase commit feature for transactions involving only one Region.
These are the two main parameters related to transaction commit.
Learn about the “D” in ACID.
I understand… It seems that writing logs in this database does not necessarily mean they must be applied to the KV store.
The transaction should be considered complete only after it is persisted to RocksDB.
It’s not just a single TiDB node. If the client returns success, other nodes should also be able to read it normally. However, if it hasn’t been applied, other nodes can’t read it, and it’s not in memory either. This doesn’t seem right.
A transaction is considered committed only after the majority of nodes have applied it.
In TiKV, other nodes that want to read this region will definitely connect to the TiKV where the primary region is located. So this primary node should be able to read it from memory.
In fact, as long as the main region’s TiKV apply is completed, it will commit and return.
Read requests do not parse the raft log. Read requests directly read from rocksdb-kv. The cache mentioned in the two-phase commit of transactions is cached on the TiDB node before commit. Therefore, if another TiDB node reads an unapplied change, it will not be able to read it.