Why can't a user's transaction return as successfully committed even though the Raft log is committed? Shouldn't the write operation have data in memory and be readable, allowing a successful commit to be returned to the client?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 为什么raft log都committed了。用户transaction还不能返回提交成功。写的操作不应该是在内存中有数据么?此时虽然没有apply到kv里面。但是应该在内存中应该可以读到了。那么就应该可以给客户端返回提交成功了吧?

| username: TiDBer_ZxWlj6A1

Why can’t the user transaction return a commit success even though the raft log is committed? Shouldn’t the write operation have data in memory? Although it hasn’t been applied to the KV yet, it should be readable in memory at this point. So it should be able to return a commit success to the client, right?

| username: zhaokede | Original post link

In a distributed system based on the Raft consensus algorithm, even if log entries have been committed, it does not mean that the user’s transaction can immediately return a successful commit to the client. This is mainly because the design goal of the Raft algorithm is not only to ensure data safety (through consensus) but also to ensure data consistency and durability.

| username: tidb菜鸟一只 | Original post link

At this point, if the machine suddenly fails, the data in the memory will be lost, but returning a successful submission is incorrect, right? The data hasn’t actually been written to the disk.

| username: 鱼跃龙门 | Original post link

This is to ensure data consistency and durability. If there is a crash at this time, the data in memory will be lost. It needs to be applied to RocksDB KV to be considered permanently successful.

| username: chenhanneu | Original post link

After a centralized database commits the redo log, it can be queried because the corresponding data is also stored in memory, allowing queries to be returned directly. Generally, redo and undo logs are only used for recovery when the database crashes.

In a distributed database like TiDB, there are no redo and undo logs; instead, it uses Raft logs. Queries are based on the returns from the KVDB. If a write is committed to Raft, it is considered successful. However, if a leader node switch occurs or a query is made from a follower node (which does not have detailed memory information and cannot directly return based on Raft log information), and the data corresponding to the Raft log has not yet been written to the KVDB, the data committed to the Raft log cannot be queried.

For reference only.

| username: DBAER | Original post link

The Raft consensus algorithm requires a majority, meaning the majority must persist; otherwise, data might be lost if a failure occurs.

| username: zhanggame1 | Original post link

Most of the raft logs should be committed.

| username: TiDBer_ZxWlj6A1 | Original post link

That’s not right. The raft log is written to RocksDB raft, so it should be persisted, just not applied to RocksDB KV yet.

| username: TiDBer_ZxWlj6A1 | Original post link

The Raft log written to RocksDB Raft should be persisted, but it has not been applied to RocksDB KV yet. Moreover, it has already been replicated to the RocksDB Raft on other nodes.

| username: 有猫万事足 | Original post link

The situation you mentioned should not occur, and it’s not that the content about the Raft algorithm mentioned above is incorrect.

Instead, TiDB has made this optimization. It allows for asynchronous commit. Moreover, this asynchronous commit parameter is set to “on” by default unless it is upgraded from a version below 5.0.

There are also some columns introducing related content for reference.

Additionally, there is another parameter:

which allows enabling the one-phase commit feature for transactions involving only one Region.

These are the two main parameters related to transaction commit.

| username: TIDB-Learner | Original post link

Learn about the “D” in ACID.

| username: TiDBer_ZxWlj6A1 | Original post link

I understand… It seems that writing logs in this database does not necessarily mean they must be applied to the KV store.

| username: ziptoam | Original post link

The transaction should be considered complete only after it is persisted to RocksDB.

| username: shixiaotuo | Original post link

Makes sense.

| username: TiDBer_jYQINSnf | Original post link

It’s not just a single TiDB node. If the client returns success, other nodes should also be able to read it normally. However, if it hasn’t been applied, other nodes can’t read it, and it’s not in memory either. This doesn’t seem right.

| username: WinterLiu | Original post link

A transaction is considered committed only after the majority of nodes have applied it.

| username: TiDBer_ZxWlj6A1 | Original post link

In TiKV, other nodes that want to read this region will definitely connect to the TiKV where the primary region is located. So this primary node should be able to read it from memory.

| username: TiDBer_ZxWlj6A1 | Original post link

In fact, as long as the main region’s TiKV apply is completed, it will commit and return.

| username: WinterLiu | Original post link

Oh, right, that’s it.

| username: TiDBer_jYQINSnf | Original post link

Read requests do not parse the raft log. Read requests directly read from rocksdb-kv. The cache mentioned in the two-phase commit of transactions is cached on the TiDB node before commit. Therefore, if another TiDB node reads an unapplied change, it will not be able to read it.