If one node fails in a 3-replica TiKV setup, will the remaining 2 nodes be strongly synchronized?

translator_bot · June 22, 2024, 3:54am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 3副本tikv如果挂掉一个节点，剩下的2个会是强同步吗？

| username: TiDBer_jYQINSnf

If one of the three TiKV replicas goes down, will the remaining two be strongly synchronized?

translator_bot · June 22, 2024, 3:54am

| username: xfworld | Original post link

At this point, an additional node needs to be added…

The remaining 2 nodes are strongly synchronized because the Raft protocol itself is strongly consistent.

translator_bot · June 22, 2024, 3:54am

| username: TiDBer_jYQINSnf | Original post link

In this case, it means that with 3 TiKV nodes, if one TiKV node goes down, it becomes strongly consistent. After a while, if another TiKV node goes down, it will be unable to serve.

Then, using a single TiKV to perform unsafe recovery, there is a high probability that data will not be lost?

translator_bot · June 22, 2024, 3:54am

| username: tidb菜鸟一只 | Original post link

Yes, according to the majority principle, with the remaining 2 nodes, having only 1 node synchronized does not satisfy the majority.

translator_bot · June 22, 2024, 3:54am

| username: xfworld | Original post link

The replicas are not lost, so why use unsafe recover? Adding a node will replenish the replicas. If all replicas are lost, then unsafe recover is needed.

translator_bot · June 22, 2024, 3:54am

| username: TiDBer_jYQINSnf | Original post link

Three replicas, one goes down, leaving two replicas. After a while, another one goes down, leaving only one replica. In this case, perform an unsafe recovery.

translator_bot · June 22, 2024, 3:54am

| username: TiDBer_jYQINSnf | Original post link

Will it be lost or not?

translator_bot · June 22, 2024, 3:54am

| username: xfworld | Original post link

Data can’t be written or read, is there any data loss?

What is the concept of data loss? Is it loss caused by service interruption, or is it loss caused by issues after data is written? Or something else?

translator_bot · June 22, 2024, 3:54am

| username: tidb菜鸟一只 | Original post link

Three nodes, if one goes down, the remaining two maintain strong consistency and can continue to provide service. To replenish the replica, simply add a new node and it will automatically be replenished.
Three nodes, if one goes down, it becomes a two-replica setup. After some time, if another node goes down, the cluster will stop providing service and become inaccessible. To resume service, at least one more node must be added to restore it to a two-replica setup, and unsafe recovery is not needed.
Three nodes, if two go down simultaneously, the cluster cannot provide service and there is a high risk of data loss. In this case, simply adding nodes will not automatically replenish the replicas, and unsafe recovery is required.

translator_bot · June 22, 2024, 3:54am

| username: TiDBer_jYQINSnf | Original post link

The difference between these two situations is the passage of time. So how long is this passage of time? Why is the final result only one replica left, with one not needing unsafe recover and the other needing unsafe recover?

translator_bot · June 22, 2024, 3:54am

| username: TiDBer_jYQINSnf | Original post link

If:
3 replicas become 2 replicas, at the moment it becomes 2 replicas, one of the replicas might not have caught up with the raft log. At this time, the more advanced replica goes down, leaving only the one that hasn’t caught up. In this case, using unsafe recover for the replica would result in some data loss, right?

When I say down, I mean completely gone, not down but can still come back up.

translator_bot · June 22, 2024, 3:54am

| username: tidb菜鸟一只 | Original post link

With 3 nodes, according to the majority principle, there may be 2 or 3 nodes with accurate data. If two nodes fail at the same time, it’s possible that the accurate ones fail, leaving the inaccurate one, which would result in data loss. If one node fails at a time, and the failed one is the accurate one, then one accurate and one inaccurate node remain. In this case, the accurate one needs to synchronize with the inaccurate one first. If synchronization is completed and another node fails, the remaining one will still be accurate. However, if synchronization is not completed and the accurate one fails, then it’s a problem. Theoretically, synchronization should be very fast, and if one node fails followed by another in quick succession without enough time for synchronization, it would be almost equivalent to two nodes failing simultaneously.

translator_bot · June 22, 2024, 3:54am

| username: xfworld | Original post link

Actions speak louder than words.

translator_bot · June 22, 2024, 3:54am

| username: TiDBer_jYQINSnf | Original post link

It should be that regardless of the order in which they crash, you have to use unsafe recover. It’s just a matter of which method has a lower possibility of data loss.

translator_bot · June 22, 2024, 3:54am

| username: tidb菜鸟一只 | Original post link

With 3 nodes, if one node goes down first, and after 5 years another node goes down, you won’t lose data… The data on the remaining 2 nodes will definitely be consistent… The scenario you mentioned about one node going down and then another one going down mainly depends on whether the remaining two nodes have completed data synchronization in the meantime.

translator_bot · June 22, 2024, 3:54am

| username: TiDBer_jYQINSnf | Original post link

Looking at the code, it roughly works like this:
There are three replicas A, B, and C. If C goes down, only A and B are left.
Assuming A is elected as the leader, when proposing, if B’s progress is always lagging, the commit index cannot move forward.
It’s equivalent to not being able to write.

If B lags too much, A might send a snapshot to B.

So, if there are three replicas and C goes down, leaving two replicas, there’s no guarantee that A and B’s data will be consistent.
If this region can write data, it means A and B are consistent.

In other words: A, B, and C are working normally, and if C goes down, and all regions in the cluster have written a piece of data and can ensure successful commits, it means A and B are strongly consistent. In this case, if B goes down, using unsafe recover won’t lose data.

However, there’s no guarantee that all regions in the cluster have written a piece of data when there are only two replicas. It’s also possible that after C goes down, A sends a snapshot to B, but B never successfully applies it, so B is always lagging, making this region unwritable. If A then goes down, and the lagging B performs unsafe recover, the data won’t be the latest.

A region is a very small part of a large cluster. Even if a region is unwritable, it doesn’t necessarily mean the table is unwritable. So there’s no way to know if the two replicas A and B are consistent.

This is just from looking at a small piece of code. It’s hard to say if there are other logics that achieve a different effect from what I described.

Waiting for others to explain this logic further.

translator_bot · June 22, 2024, 3:54am

| username: TiDBer_jYQINSnf | Original post link

No data loss for 5 years, what about 4 years? 3 years? 2 years? 1 year? 1 month? 1 day? How long is considered long? Without a landmark event marking the reliability of this cluster, it is unreliable . The longer the time, the lower the probability of data loss. Even events with a probability of 0 can happen.

translator_bot · June 22, 2024, 3:54am

| username: xfworld | Original post link

The purpose of studying principles is for better control, better application, avoiding pitfalls, and providing better practice paths and experiences.

Based on this goal, it is definitely necessary to set scenarios; empty talk doesn’t have much meaning.

translator_bot · June 22, 2024, 3:54am

| username: TiDBer_jYQINSnf | Original post link

The scenario is that there is a cluster with 10+ TiKV nodes. In February, one TiKV node was taken offline (the operations personnel were inexperienced and directly destroyed the Docker container). Then a few days ago, another TiKV node was taken offline (again, directly destroying the Docker container). As a result, the service became unavailable.

After filtering the regions, it was found that there were 1000+ regions with 2 replicas on these two nodes. After performing an unsafe recovery, the service can now be provided normally, but it is uncertain whether any data was lost. So, I am raising this issue for discussion.

Of course, I definitely didn’t do this , but it did happen. So, is this still just empty talk?

As for why the replica wasn’t replenished during this long period, I’m not sure either. Looking at the store information and region information, this is the situation.

translator_bot · June 22, 2024, 3:54am

| username: xfworld | Original post link

If there are scenarios, please explain these backgrounds when posting to avoid ineffective answers and speculation…

If it’s a production environment, would you let a novice handle the operations and processing? Or if the data is very important, the company wouldn’t dare to operate like this, right?

Moreover, the overall planning and usage methods should definitely be done before going live, including the processes and plans for operations and processing. You wouldn’t wait until a problem is discovered to find a solution, right? (Your clients or users can’t wait that long…)

But commercial subscription users are an exception because they are Vvvvvvvip users…