What is the design philosophy behind everyone's operation of TiDB?

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 大家运营tidb的设计哲学到底是如何的

| username: tidb狂热爱好者

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version]
[Reproduction Path] What operations were performed that led to the issue
[Encountered Issue: Issue Phenomenon and Impact]
[Resource Configuration]
[Attachments: Screenshots / Logs / Monitoring]

When deploying TiDB, do you deploy each TiDB separately or together?
For example, if you have 30 physical hardware machines, would you deploy them as 10 separate TiDB clusters for 10 departments to use, or would you deploy them together as one large TiDB cluster for everyone to use, creating 10 accounts to manage 10 different businesses?
Refer to Zhen Huan Zhuan

| username: 数据小黑 | Original post link

If you add Placement Rules in SQL to the situation you mentioned, can it achieve physical isolation between tenants? If it is a single cluster at this time, is it easier to manage? This is somewhat like the difference between virtualization solutions and cloud solutions. Obviously, the trend is cloud.

| username: realcp1018 | Original post link

Obviously, it’s the former. Physical isolation to mitigate risks is always the most reliable.

| username: tidb狂热爱好者 | Original post link

Please elaborate.

| username: Jellybean | Original post link

If resources are sufficient, meaning cost is not an issue, based on our experience, it is still best to deploy separately.

  1. First, for businesses with the same characteristics that can be merged into one cluster, try to place them in the same cluster. This can save machine costs and human maintenance costs. For different businesses such as pure AP, pure TP, and HTAP, it may be best to deploy them separately. The main reasons for separate deployment are:

  2. Physical isolation ensures that any issues or failures are confined to a single cluster, reducing the risk of widespread impact.

  3. Although operations and maintenance may be relatively cumbersome, the risks are isolated, leading to greater long-term benefits.

  4. Different clusters can be tailored to different business needs. AP, TP, and HTAP clusters can be customized as needed, with cluster parameter adjustments, upgrades, and other operations performed as required. This prevents the need to restart the entire cluster for one business, which would affect all businesses and pose a high risk.

There are other factors, such as monitoring different businesses, cluster sensitivity, data security management, and 404 auditing, which are not detailed here.

Specific operations should be comprehensively evaluated based on business conditions, available resources, and leadership requirements.

| username: xfworld | Original post link

Once you find the balance point, it becomes easy to judge.

For example: isolation factors, you can’t let one business occupy more resources and affect the operation of other businesses…
Another example: stacking factors, if a certain scenario uses fewer resources, you can stack multiple scenarios together, which won’t lead to resource shortages and will actually improve utilization.

What is your scenario?

| username: Kongdom | Original post link

Separate deployment is the best. Previously, we used virtual machines on servers, with one product divided into several virtual machines. Eventually, some product groups were allocated fewer resources than a laptop.
Separate, it must be separate.

| username: tidb狂热爱好者 | Original post link

The brother is being sarcastic.

| username: tidb狂热爱好者 | Original post link

Deploying separately is the best. Previously, we used virtual machines on servers, with one product split across several virtual machines. Eventually, some product groups had fewer resources allocated than a laptop.

Separate, it must be separate.

One business with 100 microservices
Gets stuck, and then after everyone graduates, their resumes have microservices experience.

| username: tidb狂热爱好者 | Original post link

100 businesses using 100 TiDB instances
Everyone’s resume looks impressive

| username: TiDBer_pkQ5q1l0 | Original post link

Having multiple clusters increases maintenance costs, but the advantage is that each business does not affect the others. Having one large cluster means you only need to manage one cluster, which reduces maintenance costs. You just need to find a balance point based on the characteristics of the business.

| username: liuis | Original post link

It’s somewhat like our k8s clusters. Previously, each business department had its own k8s cluster, and the entire company had almost 200 clusters. Now, they have all been consolidated, making maintenance much easier.

| username: tidb狂热爱好者 | Original post link

This requires the team handling k8s to ensure the underlying availability. The underlying layer of TiDB does not need any maintenance.

| username: tony5413 | Original post link

The integration of multiple services should be the trend, as it allows for the reasonable utilization of resources.

| username: tony5413 | Original post link

Would you be more worried if 2 out of 3 physical machines in a cluster failed at the same time, or if 2 out of 30 physical machines in a cluster failed at the same time?

| username: realcp1018 | Original post link

If 2 out of 3 physical machines in a cluster fail simultaneously, are you more worried, or are you more worried if 2 out of 30 physical machines in a cluster fail simultaneously?
Most of the time, it’s the latter… The former affects at most 1/10 of the business, while the latter, if 2 out of 3 PD nodes fail simultaneously, the cluster is done for. If you’re lucky and it’s the TiKV instances that fail, then some regions will become single replicas and unable to serve externally. Although in terms of data volume it also affects 1/10, it involves all services.
The core idea of many algorithms is to divide and conquer, and it’s the same here. In reality, scenarios suitable for centralized deployment are extremely rare. The K8S philosophy is more aligned with stateless services and share-nothing services. Strictly speaking, TiDB is not share-nothing, as the cluster requires too much internal interaction.

| username: magic | Original post link

Actually, it still depends on whether the cluster can stably support the current data volume, whether it is highly available, and whether resource isolation is in place. From the perspective of clusters, it is quite normal for big data clusters to have dozens, hundreds, or even thousands of nodes. Ultimately, it comes down to stability…

| username: Jellybean | Original post link

Here’s another extreme example:
Assuming there are 3 replicas, if 2 TiKV nodes become unavailable at the same time and the system metadata happens to be on these two TiKV nodes, only one replica remains. This data becomes unavailable, meaning the cluster metadata is unavailable, and the entire cluster becomes unavailable to the outside world, game over…

Therefore, the more machines there are, the higher the probability of simultaneous node failures, and the greater the risk. However, it is a low-probability event, but once it happens, it results in a major failure.

| username: xingzhenxiang | Original post link

All-in-one is not advisable. If each department only has three machines, it would be more convenient to use a MySQL cluster. Specific problems require specific analysis.

| username: Kongdom | Original post link

I suspect you installed surveillance in our company. Initially, we had many microservices, but later we merged them into two microservices: one for business and one for analysis. :wink: Since then, our resumes have included experience with microservices.