Discussion on the Total Resource Control Limit in TiDB

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB资源管控总量上限讨论

| username: TiDBer_yyy

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] 7.1.0

Inquiry: How to check the total resource control limit of TiDB?
In the dashboard, the total consumption is 226K RU,


In the Grafana monitoring, the maximum consumption in the last 30 days is 274K RU.

Questions:

  1. What is the RU limit of the cluster load?
  2. At the TiDB level, using HAProxy for high availability, is the total RU calculated based on the CPU/IO/network resources of all TiDBs? If so, how to configure the RU of the resource group if there are differences in machine configurations?

Background:

  1. Need to plan resource isolation for production and investigate how to determine the total production resources.
  2. Using the official [Load Estimation Method], the total resource is only 164K, which is quite different from the data. Which one should be used as the standard?
| username: dba远航 | Original post link

This depends on the actual production situation. First, the machine configuration, and second, the production load.

| username: TIDB-Learner | Original post link

I’ve been wanting to study this resource management recently, so I followed this thread.

| username: forever | Original post link

Shouldn’t we first estimate the concurrency, then conduct a stress test to evaluate the total resources, and add more if it’s not enough?

| username: TiDBer_yyy | Original post link

It shouldn’t be the case. To implement a plan, you need to create a proposal, and creating a proposal requires calculating the total amount. If you do it the other way around, even Jesus can’t save it.

| username: TiDBer_yyy | Original post link

Yes, but without a total amount, it’s impossible to allocate resources to each business.

| username: forever | Original post link

You should first estimate the resources based on production, then perform a stress test on the current resources according to the expected production concurrency. If it meets the requirements, it’s okay; if not, add more resources.

| username: TiDBer_yyy | Original post link

When making a plan, there must be a basis; we can’t just make a wild guess.

| username: forever | Original post link

How can stress testing be called a shot in the dark? :sweat_smile:

| username: Jellybean | Original post link

The evaluation of cluster resources initially can only be calculated automatically by the system based on hardware configuration to obtain the total RU amount. Then, based on this estimated amount, resource groups are created and tenants are allocated.

After running for a period of time, the accurate total RU and consumption amount will be obtained based on the actual load during this period. At this time, the administrator needs to intervene and adjust the allocation of different tenants’ resource groups again.

You can first conduct some tests and observations in the test environment, then analogously estimate the machine conditions in the production environment and infer the required total RU amount.

The good aspect of this content is that the official team also attaches great importance to it and is continuously promoting optimization, including the issues you mentioned, as well as the scope and methods of management and control.

Let’s take it step by step.

| username: TiDBer_yyy | Original post link

Understood, the planning scheme before going live “unable to determine the total amount” has become a project bottleneck, blocking the feature from going live.

| username: oceanzhang | Original post link

Has there been any statistics on the maximum amount of data so far?

| username: TiDBer_yyy | Original post link

Data volume or resource capacity?

| username: TiDBer_yyy | Original post link

I hope the official team can provide an accurate answer.

| username: 连连看db | Original post link

I feel that OB’s resource management is a bit stronger than TiDB’s.

| username: TiDBer_yyy | Original post link

At present, the first step [assessing the total amount] cannot be carried out, let alone implementation.

| username: TiDBer_yyy | Original post link

Version 7.2 supports the Remaining RU View API, which can be viewed in Grafana.

Conclusion: Currently, version 7.1 does not support viewing Remaining RU or Total RU.

| username: 有猫万事足 | Original post link

  1. Estimating based on hardware resources is generally reliable. The RU value calculated based on load can have significant fluctuations due to the load conditions over time. In my own practice, configuring RU based on the latter might cause TiKV to crash.
  2. You should notice that even if you choose to estimate based on hardware, once you enter this page and select the evaluation method as OLTP_READ_ONLY/OLTP_WRITE_ONLY, the evaluated RU also fluctuates significantly. These two evaluation methods’ values can roughly be seen as the maximum read RU and write RU.
    From my observation, the former depends on the configuration of your single TiDB server, while the latter depends on the number of your TiKV instances.
    In other words, in most cases, the maximum read RU and write RU of a TiDB cluster are not consistent. In Grafana, you can further see the monitoring data for RRU and WRU.
| username: TiDBer_yyy | Original post link

Thank you very much. I roughly understand.

In our case, we have a single cluster with 6 TiDB servers. When estimating the RU value based on hardware, are we only considering the value of a single TiDB server?

Currently, we observe that when the result of the load-based evaluation is high, the TiKV CPU load is also relatively high.

In our situation, the hardware-based estimation result is much greater than the load-based evaluation. However, when the RU consumption based on the load evaluation is relatively high (close to 2/3 of the hardware estimation), the TiKV cluster load is quite high, reaching around 90%.

| username: 有猫万事足 | Original post link

The read and write RU limits in the same cluster are generally inconsistent. From practical experience, based on hardware estimation → oltp_read_only, it roughly represents the upper limit of read RUs. Hardware estimation → oltp_write_only, can roughly be considered the upper limit of write RUs.

For example, in my cluster, the RU calculated by hardware estimation → oltp_read_only is only 4000, but the RU for hardware estimation → oltp_write_only can reach 30,000. If I perform batch imports, it is easy to observe the RU reaching 30,000. If there is no import and only select operations are executed, the RU reaches 4000, and it is difficult to go higher without control. Therefore, whether the user’s load is read-heavy or write-heavy is a crucial issue. Especially in cases where there is a significant difference between the two, if I only allocate 4000 to a write-heavy user, the import speed will definitely be slow. For mixed loads, it is essential to check in Grafana. The location is in Grafana → TiDB-Resource-Control → Resource Unit. Here, you can see RRU and WRU.

My tests also show the same results. If I set the actual RU based on the load evaluation results, it might indeed run that high, but the cost could be instability in TiDB or TiKV. Therefore, I did not adopt this evaluation method subsequently.