[Experience & Questions] Experiences and Questions about TestOneSplit

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: [经验贴&疑问贴]关于TestOneSplit的经验和疑问

| username: 赖沐曦_bullfrog

When working on lab3b, I encountered two bugs while testing TestOneSplit. The first bug occasionally caused a timeout due to a split after leader selection, resulting in a split-brain scenario. This made it difficult for the client to determine the leader, leading to slow message synchronization and eventual timeout.

Today, while searching for a solution to the second bug (the logs were not helpful /(ㄒoㄒ)/~~), I found [经验分享] 关于 oneSplitTest 卡死的解决方法 - TiDB 的问答社区, where someone had already written a solution for the first bug. Here, I share my similar but easier-to-implement solution:

First, add two states for the leader lease implementation, similar to heartbeat:

// leader lease
leaseElapsed int
leaseTimeout int

Then, add a field in Progress to determine if a heartbeat was received:

type Progress struct {
	Match, Next uint64
    // Used to determine if there was a recent successful interaction with progress
	isHeartbeat bool
}

Each time the leader receives a heartbeat response or append response, set the corresponding progress’s isHeartbeat to true. Then, tick the lease just like the heartbeat, and the rest of the implementation details will follow.

However, this only solves the first bug. The second bug occurs in the following test (with about a 1/30 chance):

func TestOneSplit3B(t *testing.T) {
	... // omitted
	req := NewRequest(left.GetId(), left.GetRegionEpoch(), []*raft_cmdpb.Request{NewGetCfCmd(engine_util.CfDefault, []byte("k2"))})
	resp, _ := cluster.CallCommandOnLeader(&req, time.Second)
	assert.NotNil(t, resp.GetHeader().GetError()) // fails
	assert.NotNil(t, resp.GetHeader().GetError().GetKeyNotInRegion()) // fails

	MustGetEqual(cluster.engines[5], []byte("k100"), []byte("v100"))
}

I have already checked keyInregion, regionEpoch, and regionId. Why does this still happen? Does anyone with similar experiences have any thoughts or ideas?


Update:
The final solution was to prevent the leader from stepping down when receiving heartbeat resp and append resp, even if the term is higher. Modifying the related raft implementation resolved the issue.

| username: T0V1P_萝卜头 | Original post link

Regarding the first bug, I haven’t encountered this issue before. My understanding is:

  1. In the case of a split-brain scenario, the PD side will alternately update the original region’s information (sometimes the leader is 5, sometimes the leader is 1-4).
  2. But once the split is completed, the RegionHeartBeat from 5 will be consistently ignored (comparing RegionEpoch will show that 5 is stale).
  3. Even before the split is completed, CallCommandOnLeader will keep retrying. If the current leader times out, it will try to randomly select another peer to Call. If it returns a NotLeader Error, it will indicate who the current correct leader is (and the entire CallCommandOnLeader will continue to be called as long as there is no result, provided it does not exceed 5 seconds or 10 attempts. It seems unlikely that the leader cannot be found all the time?).

Regarding the second bug, after the split is completed, I understand that this Request will normally reach the leader and go through the Raft Entry replication process. I’m not sure if there is a check for Epoch/Key in the region during the Apply Entry phase (sometimes an Entry is considered valid during the Propose phase, but during the apply process, if the RegionEpoch/StartKey and EndKey are changed before applying it, it should return). If this reply has no error, will it return a value? If it does return, it indicates that there might still be a logical issue here. Clearly, querying a key that is not within the range of the split region should not return a corresponding value.

| username: 赖沐曦_bullfrog | Original post link

First of all, thank you for your reply which helped me confirm the direction. Regarding the first bug, it was my miscommunication; the correct explanation should be your answer. For the second issue, I have checked all the scenarios according to the guidebook, but the error still occurs. Finally, it seems that the problem is due to the peer-to-peer election process being too slow, causing a timeout. The request returns nil because the timeout in the test is only 1 second.

| username: system | Original post link

This topic was automatically closed 1 minute after the last reply. No new replies are allowed.