The test results of resource control management background tasks do not meet expectations

translator_bot · June 21, 2024, 10:18am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 资源管控管理后台任务测试效果不符合预期

| username: Jellybean

[TiDB Usage Environment]
Test Environment

[TiDB Version]
v7.5.0
Local cluster started with tiup playground: 1 tidb + 1 pd + 1 tikv, only for functionality testing, not involving performance testing.

[Reproduction Path]
[Encountered Issue: Problem Phenomenon and Impact]
Test Expected Goal:
Both front-end tasks and back-end tasks use the default resource group.

When a task is marked as a back-end task, TiKV will dynamically limit the resource usage of that task to minimize its impact on front-end tasks during execution. By setting automatic recognition of back-end tasks and reducing their resource consumption.

Start the local cluster with tiup, resource control is enabled.
mysql> show variables like ‘tidb_enable_resource_control’;
±-----------------------------±------+
| Variable_name | Value |
±-----------------------------±------+
| tidb_enable_resource_control | ON |
±-----------------------------±------+
1 row in set (0.01 sec)

TiKV parameter resource-control.enabled is not modified, using the default value true.

Prepare test data and use lightning to verify if the back-end task is effective.
tiup bench tpcc -H 127.0.0.1 -P 4000 --user app_oltp -D tpcc --warehouses 100 --threads 20 prepare
tiup dumpling -u root -h 127.0.0.1 -P 4000 -B tpcc --filetype sql -t 8 -o ./tpcc_data -r 200000 -F 256MiB
Import with lightning, verify the situation without setting back-end tasks.
mysql> SELECT * FROM information_schema.resource_groups;
±---------±-----------±---------±----------±------------±-----------+
| NAME | RU_PER_SEC | PRIORITY | BURSTABLE | QUERY_LIMIT | BACKGROUND |
±---------±-----------±---------±----------±------------±-----------+
| default | UNLIMITED | MEDIUM | YES | NULL | NULL |
| rg_olap | 400 | MEDIUM | YES | NULL | NULL |
| rg_oltp | 1000 | HIGH | YES | NULL | NULL |
| rg_other | 100 | MEDIUM | NO | NULL | NULL |
±---------±-----------±---------±----------±------------±-----------+
4 rows in set (0.01 sec)

tiup tidb-lightning -config tidb-lightning.toml
4. After the import is completed, wait a few minutes for the cluster to stabilize. Delete the imported data.
5. Set lightning import as a back-end task.
mysql> ALTER RESOURCE GROUP default BACKGROUND=(TASK_TYPES=‘lightning’);
Query OK, 0 rows affected (0.15 sec)

tiup tidb-lightning -config tidb-lightning.toml

Wait for the import to complete and observe the resource consumption.

image1380×598 39.8 KB
Conclusion: The back-end task did not take effect. Whether or not the back-end task is set, the resource consumption of lightning is almost the same, and the impact on front-end tasks is still relatively large.
Continue to verify other scenarios, and similar conclusions are drawn.
mysql> ALTER RESOURCE GROUP default BACKGROUND=(TASK_TYPES=‘lightning,stats,ddl,br’);
Query OK, 0 rows affected (0.27 sec)

DDL through add index back-end task, whether or not the back-end task is set, no optimization.
Stats through analyze a table with 33 million rows, setting the back-end task also did not optimize, and even further squeezed the resources of front-end tasks, resulting in a sharp decline in available resources for front-end tasks, as shown below:

image1380×668 52.4 KB

Official website:

Questions:

The actual results do not meet expectations. Are there any key configurations missing?
Have other experts tested similar functions, and do they have the same phenomenon? How to solve it?

translator_bot · June 21, 2024, 10:18am

| username: tidb狂热爱好者 | Original post link

Actually, I think resource control doesn’t need to be enabled usually.
Why? Because in MySQL’s sharding, most of the time the CPU usage doesn’t even reach 1%.
It can’t auto-scale yet, just enable auto-scaling and it will be fine.

translator_bot · June 21, 2024, 10:18am

| username: 有猫万事足 | Original post link

My understanding is as follows.
TASK_TYPES='lightning,stats,ddl,br' means that these types of tasks will eventually be assigned to the resource group default. And the limitation of this resource group default is UNLIMITED. So it’s actually the same as having no restrictions.

You might consider assigning these tasks to the rg_olap resource group and then see the results.

ALTER RESOURCE GROUP rg_olap BACKGROUND=(TASK_TYPES=‘lightning,stats,ddl,br’);

translator_bot · June 21, 2024, 10:18am

| username: Jellybean | Original post link

“Currently, all background tasks for resource groups are by default bound to the default resource group for management. You can globally manage background task types through default. It is not yet supported to bind background tasks to other resource groups.”

The official limitation is that it can only be placed under default.

translator_bot · June 21, 2024, 10:18am

| username: 裤衩儿飞上天 | Original post link

Haven’t tested it yet. Under the current circumstances, I think it would be reasonable to limit the RU_PER_SEC and PRIORITY of the default.

translator_bot · June 21, 2024, 10:18am

| username: lilinghai | Original post link

Are both front-end tasks and back-end tasks using the default resource group?

translator_bot · June 21, 2024, 10:18am

| username: Jellybean | Original post link

Yes, both front-end tasks and back-end tasks are under the default resource group, and there are no other tasks in the entire cluster, and other resource groups are not in use.

translator_bot · June 21, 2024, 10:18am

| username: gaolei | Original post link

The background task management feature of Resource Control dynamically adjusts the resource quotas for background tasks based on the CPU/IO resource utilization of TiKV. Therefore, it relies on setting the correct quotas when deploying the cluster. If multiple components or instances need to be deployed on a single node, appropriate quotas should be set for each instance through cgroups. Hence, it is recommended to use the tiup cluster to deploy a standard cluster for testing if possible. Features that are sensitive to cluster resources, such as scheduling, usually do not perform well in a playground environment.

translator_bot · June 21, 2024, 10:18am

| username: Jellybean | Original post link

Thank you for your response.

Using the playground is intended for quick functional verification, but it might indeed be affected by factors like mixed deployment and single machine interference. For scenarios involving CPU/IO resource usage testing based on TiKV, it is better to verify in a complete cluster.

translator_bot · June 21, 2024, 10:18am

| username: dba远航 | Original post link

You are using a single-machine simulation, so the test results are difficult to simulate the real effect.

translator_bot · June 21, 2024, 10:18am

| username: Jellybean | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.