The more cluster storage nodes, the lower the business availability

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 集群存储节点越多, 业务可用性越低

| username: in-han

There is an article about the storage architecture of TiKV, which discusses the issue of multi-group raft mode. In this mode, regions are evenly distributed across the cluster, and the larger the cluster, the higher the possibility of data loss (for example, losing data when two TiKV nodes fail simultaneously in a three-replica setup).

It emphasizes two points:

  1. Data availability != business availability. For example, losing 20% of the data may have a far greater impact on business than 20%. If a business process involves 5 keys, the impact on business availability is 80% ^ 6 = 20%. The impact is even greater for relational databases.
  2. It implies that typical data distribution methods in the industry, such as TiDB and Oceanbase, are toy-level projects (especially in large-scale clusters).

My question is, since TiKV regions are randomly distributed, if deploying a large cluster, will there be an issue of overall availability reduction? Is there any plan to address this?

| username: zhanggame1 | Original post link

For TiDB, if the data is indeed very important, you can consider increasing the number of replicas from 3 to 5 to improve availability. The number of replicas can be increased for the entire cluster, or you can use Placement Rules in SQL to flexibly specify the number of replicas for a particular table and their locations.

Regarding the issue that the larger the cluster size, the greater the possibility of data loss, this is also mentioned in the official TiDB courses.

| username: xfworld | Original post link

The planning depends on the business’s emphasis on data. The number of replicas is optional, and the number of nodes is the same. These are two trade-offs.

Additionally, the scheduling strategy described above can also meet the rule definitions for different business scenarios regarding data presence on different nodes.

If you have any doubts, it’s better to try it out in practice. It can be emphasized that even in the case of three replicas and three nodes, if two nodes fail, the data is still not lost…

At this point, effective steps are needed to activate and ensure the availability of the cluster…

| username: WinterLiu | Original post link

It should be avoidable through scheduling and replica count control, such as specifying the host for the region and increasing the number of replicas.

| username: zhanggame1 | Original post link

The Raft protocol with 3 replicas returns success when 2 writes are successful. If the third write fails and the first 2 successful writes go down, data will indeed be lost. Although recovery is possible, data consistency becomes questionable.

| username: dba-kit | Original post link

+1, but indeed, the more machines in the cluster, the higher the probability of two nodes failing simultaneously. You can only evaluate whether cost or availability is more important and decide if you need to increase the number of replicas to solve this issue. However, speaking of which, all distributed databases have this problem. Even if you don’t use a distributed database, if the data volume is really large and you use thousands of MySQL instances to form a large cluster, the probability is actually the same. No architecture can escape this.

| username: in-han | Original post link

Thank you all for your replies. The main point of the PolarDB article is about how to manage the impact range of failures in large-scale clusters. It provides an example of how, in a three-replica setup, the failure of two machines simultaneously can be handled to minimize or avoid affecting region availability.

| username: h5n1 | Original post link

If two out of three replicas fail, any system that requires majority consensus cannot guarantee region availability, but it doesn’t necessarily mean data loss. For example, TiDB has an unsafe recover feature to restore data from a single replica. However, if there are five replicas, then losing two is not an issue. This question can be modified to: What happens if three out of five replicas fail? Then consider seven replicas.

| username: in-han | Original post link

Oh, what I want to say is that in the case of three replicas, with N TiKV instances, the probability of two instances failing at the same time increases as N increases. In the event of simultaneous failures, the regions on these two machines become unavailable.

From the perspective of data loss, the possibility of data loss increases as N increases.

| username: h5n1 | Original post link

As N increases, the probability of 2 failures also increases, and the probability of the region becoming unavailable will also increase. Data loss is unrelated to this; if there is only one region replica left, and its data is fully synchronized with the Leader or it was originally the Leader, then data will not be lost (assuming the tool is functioning correctly). If it is not fully synchronized, then data will be lost.

| username: zhanggame1 | Original post link

“The possibility of data loss increases as N increases.” Indeed, this is the case; overall reliability decreases as N increases.

| username: tidb菜鸟一只 | Original post link

The probability of 2 out of 3 failing is definitely much higher than 2 out of 100 failing, but the probability of 2 out of 3 replicas in the same region failing is 100%, whereas the probability of 2 out of 3 replicas in the same region failing out of 100 is much lower… Actually, the overall reliability does not necessarily decrease.

| username: Kongdom | Original post link

:thinking: Am I the only one confused? Three replicas with two nodes down are two different metrics, right?
If three replicas on three nodes have two nodes down, it becomes unavailable, but if three replicas on five nodes have two nodes down, it is still available.
In this way, the availability should be higher, right?

| username: 像风一样的男子 | Original post link

With three replicas and five nodes, if two nodes fail, most regions will become single replicas, and the cluster can be considered half-paralyzed.

| username: Kongdom | Original post link

:joy: So if a node fails, we just ignore it? Even a rich landlord wouldn’t be this wasteful~

| username: 像风一样的男子 | Original post link

It’s hard to say. If the monitoring is inadequate and one node goes down without being noticed, and then another one fails after a while, it could be disastrous.

| username: TiDBer_小阿飞 | Original post link

I see everyone talking about odd-numbered nodes and odd-numbered replicas. What if it’s even-numbered nodes and odd-numbered replicas?

| username: Kongdom | Original post link

:wink: Because the official recommendation is an odd number~

| username: 像风一样的男子 | Original post link

As long as the number of nodes is greater than the number of replicas, it doesn’t matter whether the nodes are odd or even, right?

| username: zhanggame1 | Original post link

If two out of three TiDB replicas fail, the database service will stop immediately. There is no such thing as continuing to run the business because TiDB is strongly consistent. According to the CAP theorem, to achieve strong consistency, you have to give up availability.