Resource Control Failure

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 资源管控失效

| username: TiDBer_61dCWQAh

[TiDB Usage Environment] Production Environment
[TiDB Version] v7.5.1
[Reproduction Path]
Create a resource group
CREATE RESOURCE GROUP rg1 RU_PER_SEC=20000, PRIORITY=MEDIUM, QUERY_LIMIT=(EXEC_ELAPSED=“1m0s” ACTION=KILL WATCH=SIMILAR DURATION=“5m0s”)

Executing SQL that exceeds 1 minute will be killed, initially effective a few times
java.sql.SQLException: Query execution was interrupted, identified as runaway query

However, after a few times, resource control becomes ineffective
For example, dwm_robot_supplier_first1 execute insert calc result 717548 cost 72051ms

Moreover, there might be cases where a join query that originally takes only 4 seconds suddenly takes 1 hour and fully utilizes the CPU

Not sure if this is a system bug, please advise

| username: 有猫万事足 | Original post link

The above only creates a resource group; you still need to bind the resource group to the user.

Is it possible that some users are not bound to this resource group, so they are not subject to the restrictions of the configuration you mentioned?

Additionally, the documentation shows ways to view the RU consumption corresponding to SQL. It is recommended to share this information. It might help in pinpointing the issue.

| username: Jellybean | Original post link

The resource control function is managed through user accounts, implementing multi-tenant features. Therefore, after the original poster created the resource group, they still need to bind users to the resource group.

You can also use the show command to view the specific content that has been applied.

| username: yiduoyunQ | Original post link

Runaway has not yet GA, it is estimated to be in v8.1 GA. You can raise a GitHub issue or wait to try it after v8.1 GA.

| username: TiDBer_61dCWQAh | Original post link

In fact, the resource group has already been bound, so runaway can take effect initially.

Executed the SQL according to the link, screenshot as follows:

| username: TiDBer_61dCWQAh | Original post link

The resource group has already been bound, and it was effective at the beginning. The RU_PER_SEC=20000 configuration has always been usable, but the QUERY_LIMIT configuration is not effective. I have also looked for the show command, but maybe I didn’t find the correct one.

| username: TiDBer_61dCWQAh | Original post link

Sure, thank you for the guidance. Is there a release date for 8.1? Is it an LTS version?

| username: ShawnYan | Original post link

It is expected to be released this month and it is LTS. You can follow the board at release: v8.1.0-LTS · Issue #50784 · pingcap/tidb · GitHub

| username: Jellybean | Original post link

Check the slow query situation of this statement, as well as the resource binding situation of the tenant (user) it belongs to, and post it here for everyone to analyze together.

| username: TiDBer_61dCWQAh | Original post link

Nice! The 7.5 upgrade should be compatible, right?

| username: TiDBer_61dCWQAh | Original post link

Sometimes it takes too long, and the server becomes unreachable, so we have to restart it. The resource binding situation is at the top.

| username: Jellybean | Original post link

Go to the Dashboard monitoring visualization page of the cluster, check the SQL statement analysis and slow query situation during the problem period, and post the screenshots.

| username: TiDBer_61dCWQAh | Original post link

The web page cannot be opened, it can only be accessed via the command line.

| username: ShawnYan | Original post link

Have you mapped the port? You definitely need to check the dashboard.

| username: dba-kit | Original post link

Check the resource group’s monitoring. Was the resource group’s token already used up at that time?

| username: dba-kit | Original post link

When the resource group tokens are exhausted, some small queries may indeed take longer to execute because they cannot obtain tokens.

| username: 有猫万事足 | Original post link

You can see ru_comsumption=0.

If it’s not an issue with the binding between the resource group and the user, then it might be as mentioned above. The resource group’s RU is exhausted, and there’s simply no RU left to allocate, resulting in extremely long wait times.
Check if there is specific information in the execution plan.

| username: TiDBer_61dCWQAh | Original post link

The server strategy is like this.

| username: TiDBer_61dCWQAh | Original post link

If the token is gone, theoretically the SQL will not be executed, and the CPU should not be using that much, right?

| username: TiDBer_61dCWQAh | Original post link

I have encountered a situation where the tokens were exhausted before, and it took a long time, but it didn’t freeze. I unbound the resource group and re-bound it, and it worked fine.