Availability Issues in TiKV Region Leader Scheduling

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIKV的Region Leader调度过程中可用性问题

| username: TiDBer_7N2ShMS5

Excuse me, everyone. For load balancing, TiKV schedules and migrates regions, moving regions from high-load nodes to low-load nodes. How is data consistency ensured for regions with the leader role before and after the migration of the region leader? Since the original node will still receive write requests during the scheduling process, will the corresponding data writes be prohibited during the scheduling process?

In other words, when a region with write capability is migrated from a high-load node to a low-load node, how is data consistency ensured before and after the migration without stopping writes?

| username: Billmay表妹 | Original post link

In TiKV, when scheduling a region leader, TiKV ensures data consistency before and after the scheduling. TiKV uses a consistency algorithm called the “Raft protocol” to guarantee data consistency.

In the Raft protocol, each Region has a Leader node responsible for handling read and write requests. When scheduling a region leader, TiKV first elects a new Leader node and then performs data replication and synchronization to ensure the new Leader node has the latest data.

Specifically, when scheduling a region leader, TiKV follows these steps to ensure data consistency:

  1. Elect a new Leader: TiKV elects a new Leader node in the cluster according to the rules of the Raft protocol to take over the Region.

  2. Data replication and synchronization: The new Leader node replicates and synchronizes data with other nodes to ensure it has the latest data. During this process, TiKV uses the log replication mechanism in the Raft protocol to copy data from the old Leader node to the new Leader node.

During this process, TiKV ensures data consistency and does not prohibit corresponding data writes. Even during the region leader scheduling process, data write operations can still be performed. This is because the Raft protocol can ensure data consistency and availability during the Leader node switch.

| username: TiDBer_7N2ShMS5 | Original post link

What I mean by leader scheduling and leader election are two different things. The leader scheduling I am referring to is the load balancing scheduling.

| username: Billmay表妹 | Original post link

Then you might need to describe the problem more clearly~

| username: TiDBer_7N2ShMS5 | Original post link

Okay, I’ll re-edit it.

| username: TiDBer_7N2ShMS5 | Original post link

Re-edited.

| username: zhanggame1 | Original post link

TiDB inherently defaults to 3 replicas. Generally, all 3 replicas are consistent, so switching the leader to another replica is quick and has minimal impact.

| username: TiDBer_7N2ShMS5 | Original post link

Thank you for your reply. What I want to discuss is not the leader election switch, but how to ensure data consistency before and after migration without stopping writes when a region with write capability is migrating from a high-load node to a low-load node.

| username: Kongdom | Original post link

It should also be controlled by TSO for different versions and transactions.

| username: TiDBer_7N2ShMS5 | Original post link

Could you please elaborate a bit more?

| username: TiDBer_7N2ShMS5 | Original post link

Does anyone know? Is the post going to sink? :joy:

| username: Soysauce520 | Original post link

Raft log

| username: TiDBer_7N2ShMS5 | Original post link

Could you please elaborate?

| username: Soysauce520 | Original post link

Writing to a region also involves writing the Raft log first. You can check this document on high concurrency best practices. High concurrency writes are definitely inseparable from scheduling and elections. There are also parameters for controlling scheduling in pd-ctl, you can take a look.

| username: TiDBer_7N2ShMS5 | Original post link

Thank you for the reply, but this document still doesn’t solve my question. The article is rather vague, it only mentions splitting the region through split, but doesn’t go into detail on how to schedule and maintain data consistency.

| username: Kongdom | Original post link

Check this out:

| username: Soysauce520 | Original post link

Therefore, during the scheduling process, the original node will still receive write requests. Will the corresponding data writes be prohibited during the scheduling process? This will not prohibit writes because it involves writing logs. First, the Raft log is written, and later the region applies the log. The region scheduling and log application should be viewed separately. Data is not directly written to the region.

| username: TiDBer_7N2ShMS5 | Original post link

I don’t quite understand what “subsequent Region application logs” means. Is the region after scheduling a follower of the region before scheduling?

| username: TiDBer_7N2ShMS5 | Original post link

Sir, I’m not very familiar with the TIKV code. After reading it several times, I feel that each time a leader region is scheduled, a follower state peer is first added on the target Store, then data is synchronized, then this peer is made the new leader, and finally the old leader is removed. Is that correct? Please correct me if I’m wrong!

| username: 大飞哥online | Original post link

The general process is to first switch to another follower with a low load to take over the leader; then find another server with a low load, create a replica, and after the raft synchronization is complete, delete the initially high-load replica. That’s roughly the process.