[TiDB Usage Environment] Production Environment
[TiDB Version] v6.1.7
[Encountered Problem: Phenomenon and Impact] Traditional MySQL master-slave replication relies on binlog, so when the write load on the master instance increases, parsing the binlog on the slave becomes a bottleneck. This can lead to a situation where after a write operation is completed and successful on the master, an immediate read operation on the slave may not find the data. However, in TiDB’s architecture, the upper-layer TiDB server is stateless. A successful write means it has been written to TiKV’s RocksDB WAL log, persisted to disk, and conforms to the Raft mechanism. So, does this really mean that within a TiDB cluster, there will never be a situation where a successful write is immediately followed by a read that cannot find the data?
After writing the data, it is written to TiKV RocksDB, then the log is written and persisted. These operations are performed by the leader, and only when the write is successful will a success result be returned to the client. When you immediately query after a successful write, it will definitely query the latest version of the data from the leader’s MVCC. Moreover, the internal server network requirement for the TiDB architecture is within 2ms, meaning that after completion, the data will be transmitted to the replica within 2ms. Normally, my company does not use a master-slave structure for TiDB. The TiDB architecture itself can handle the failure of any single server without issues, so there is no need to set up an additional replica server.
TiDB is based on the Raft protocol. Under normal circumstances, the scheduling and data replication mechanisms of the PD component can ensure data security and consistency. However, data writing and reading are handled by the Leader node. If a failure or transfer occurs after the Leader node completes the write, the newly elected Leader node may not have the data that was just written, which could result in a situation where the data cannot be read. However, this probability is very small, so there is no need to worry.
TiDB’s read and write operations are consistent. According to your example, it is [Write: Primary], [Read: Primary].
It also supports [Read: Replica], but it needs to be set manually. Finally, please note: In TiDB, primary and replica refer to replicas, not nodes. This is completely different from MySQL.
“Reading from the replica won’t cause inconsistency, okay… Reading from the replica also needs to confirm consistency with the primary, rather than just giving the result directly.”
You can refer to the official documentation for consistency guarantees.
Region and Raft Protocol
Regions and replicas maintain data consistency through the Raft protocol. Any write request can only be written on the Leader and needs to be written to the majority of replicas (the default configuration is 3 replicas, meaning all requests must be successfully written to at least two replicas) before returning a successful write to the client.
Consistent reads and basic operations are beyond doubt. If the Raft consensus algorithm cannot guarantee consistency, then Google should retract its paper from that year, and all infrastructure based on it would collapse.
The only thing to note is that TiDB’s default transaction isolation level is RR, not RC.
When the transaction isolation level is repeatable read, only the data modified by other transactions that have been committed at the time the transaction starts can be read. Uncommitted data or data committed by other transactions after the transaction starts are not visible. For this transaction, the transaction statements can see the modifications made by previous statements.
begin t1;
update t;
begin t2
commit t1;
select from t – At the RR transaction isolation level, the update t cannot be seen at this time because t1 was not committed when t2 began.
commit t2
select from t – Only at this point can the update t be seen.