Help needed for tinyky's project3b

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 求助tinyky的project3b

| username: TiDBer_wJT8329l

When working on this project, I found that the new node created by the AddNode command needs to receive a snapshot to achieve data synchronization. During the process of creating the node, I encountered the following issue:

The snapshot seems to always be generated according to the latest Region configuration, but this Region configuration and the meta.Index pointed to by the snapshot are not at the same point in time. The diagram is below.

At index=14, there is only one node, node 1. At index=15, node 2 is introduced. After node 2 is created, it needs to request a snapshot. Node 1 first sent a snapshot with meta.Index=5, peers=[1,2,3,4,5] (this test case initially had five nodes, which were reduced to one node before adding a new node). This snapshot was discarded because the configuration was outdated. Then, node 1 generated a new snapshot. Node 1 sent a snapshot with peers=[1,2] but meta.Index=14 to node 2. I don’t understand why such a snapshot was sent.

I believe meta.index should refer to the index of the last log included in this snapshot, and its configuration should be peers=[1].

Is there a bug in my previous experiment?

| username: xfworld | Original post link

I don’t understand what you’re asking…

| username: Jellybean | Original post link

After reading, I’m completely confused. What exactly are you trying to ask?

| username: TiDBer_wJT8329l | Original post link

I think there is an issue with the snapshot sent where “meta.Index=14 but peers=[1,2]”. I believe that at “index=14, peers=[1]”, and peers=[1,2] is the result after applying the AddNode entry at index=15.
I think either meta.index = 15 and peers=[1,2]; or meta.index=14 and peers=[1].
I believe meta.index = 14 but peers=[1,2] is problematic.

| username: TiDBer_jYQINSnf | Original post link

Please take a screenshot of the corresponding code location and send it over, so I can take a look at the logic around it.

| username: TiDBer_wJT8329l | Original post link

Here, the validate function requires that the Snapshot needs to include the latest configuration.
Through debugging, I found that the meta.index in the snapshot is not the same value as compactIndex, which I find very strange.

| username: TiDBer_wJT8329l | Original post link

G, I made a mistake in the code :innocent:, the debug issue above was due to my code.
However, I’m still curious about the relationship between meta.Index in the snapshot and compactLog. I noticed that although the leader sent a snapshot, it didn’t apply the snapshot itself or truncate the logs before the snapshot.

| username: TiDBer_jYQINSnf | Original post link

If the log marked by the index is compacted, the snapshot will need to be resent.

| username: TiDBer_wJT8329l | Original post link

Why is 3b so difficult to handle? The test case is TestConfChangeRemoveLeader3B, and I’ve encountered another baffling issue.

Here’s a rough description of the problem I encountered:
The node IDs and store IDs in this test case are equal.

Initialize a cluster with node configuration [1, 2, 3, 4, 5].
After several additions and deletions of nodes,
I ended up with a cluster configuration of [5].
Then, after adding a node, the configuration changed to [5, 1].

Then the test case got stuck. I found that leader 5 kept sending snapshots and heartbeats to follower 1, but node 1 wasn’t even created by store 1. Store 1 also couldn’t receive messages.


That’s roughly the problem. Please give me some debugging ideas.

| username: TiDBer_jYQINSnf | Original post link

In TiKV’s processing, a peer is created when it receives a heartbeat for the first time. If the peer on the other end does not exist in the batch system, a control message will be sent to the store. Upon receiving this message, the store creates the peer and then sends the previously received message to the peer.

| username: TiDBer_wJT8329l | Original post link

Yes, thank you. I think the problem lies in the creation of the node.
When there is only one node, issuing the add node command will directly reach consensus and apply; however, if the added node never comes online, the Raft cluster becomes unusable. A 2-node configuration requires consensus from both nodes, and if the other node doesn’t come online, consensus cannot be reached.
In Chapter 4 of the Raft author’s PhD thesis, it is mentioned that the node should first enter the Learner state to ensure that the added node is available before proceeding with the voting node addition.
So the question is, does tinykv’s 3b need to add a learner mechanism to pass this test case?

| username: TiDBer_wJT8329l | Original post link

It seems that it still can’t be created. I tried to send a heartbeat before proposing, but this heartbeat was rejected. This is my logic.
Is there any way to actively create a node here?


| username: TiDBer_jYQINSnf | Original post link

The destination end rejected the request. Let me show you the code location in TiKV, here is maybe_create_peer:

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.