Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.Original topic: [经验贴&疑问贴]关于TestOneSplit的经验和疑问
When working on lab3b, I encountered two bugs while testing TestOneSplit. The first bug occasionally caused a timeout due to a split after leader selection, resulting in a split-brain scenario. This made it difficult for the client to determine the leader, leading to slow message synchronization and eventual timeout.
Today, while searching for a solution to the second bug (the logs were not helpful /(ㄒoㄒ)/~~), I found [经验分享] 关于 oneSplitTest 卡死的解决方法 - TiDB 的问答社区, where someone had already written a solution for the first bug. Here, I share my similar but easier-to-implement solution:
First, add two states for the leader lease implementation, similar to heartbeat:
// leader lease
leaseElapsed int
leaseTimeout int
Then, add a field in Progress to determine if a heartbeat was received:
type Progress struct {
Match, Next uint64
// Used to determine if there was a recent successful interaction with progress
isHeartbeat bool
}
Each time the leader receives a heartbeat response or append response, set the corresponding progress’s isHeartbeat to true. Then, tick the lease just like the heartbeat, and the rest of the implementation details will follow.
However, this only solves the first bug. The second bug occurs in the following test (with about a 1/30 chance):
func TestOneSplit3B(t *testing.T) {
... // omitted
req := NewRequest(left.GetId(), left.GetRegionEpoch(), []*raft_cmdpb.Request{NewGetCfCmd(engine_util.CfDefault, []byte("k2"))})
resp, _ := cluster.CallCommandOnLeader(&req, time.Second)
assert.NotNil(t, resp.GetHeader().GetError()) // fails
assert.NotNil(t, resp.GetHeader().GetError().GetKeyNotInRegion()) // fails
MustGetEqual(cluster.engines[5], []byte("k100"), []byte("v100"))
}
I have already checked keyInregion, regionEpoch, and regionId. Why does this still happen? Does anyone with similar experiences have any thoughts or ideas?
Update:
The final solution was to prevent the leader from stepping down when receiving heartbeat resp and append resp, even if the term is higher. Modifying the related raft implementation resolved the issue.