Issues with Runaway Queries in Fault Handling Scenarios

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: Runaway Queries在故障处理情况下存在的问题

| username: 人如其名

From the documentation, it appears that Runaway Queries only supports identifying queries that exceed the current execution time limit (EXEC_ELAPSED). However, in failure scenarios where there are multiple concurrent slow queries, the corresponding ACTION may encounter issues:

  1. Set to COOLDOWN: This might overwhelm the application connection pool and is generally not used in failure scenarios. It is more useful when only 1-2 resource-intensive batch queries run during business hours to avoid disrupting normal operations.
  2. Set to KILL: This is the optimal handling method when a large number of concurrent “slow queries” occur during regular business hours.

When action=kill is selected, assuming the watch matching method is chosen as:

  • EXACT:
    Only queries that exceed the EXEC_ELAPSED time will be killed.
    Advantages: TP queries without bound variables during business hours are mostly not killed (even if some queries have significant execution delays, being killed does not affect the overall situation).
    Disadvantages: SQL with bound variables may be uniformly killed, and for SQL without bound variables, only those exceeding EXEC_ELAPSED will be killed. Therefore, if the front-end application continuously sends transactions, it will still lead to high database latency. This setting is rarely used in failure handling scenarios.

  • SIMILAR:
    When caught by RUNAWAY_WATCHES, all queries with the same SQL fingerprint within the specified time period are blocked from execution.
    Advantages: It is easy to identify high-latency queries, and subsequent similar SQLs are directly blocked, significantly reducing database resource usage.
    Disadvantages: 1. Normal TP queries are easily caught by RUNAWAY_WATCHES, commonly when a few large queries consume the unified read pool. Normal TP queries, even if they only query a few records via index back table, may be marked as unexpected due to long total_suspend_time, leading to blocked TP queries and larger failures; 2. If slow queries are caused by missing indexes, adding indexes will restore normalcy, but SQLs will still be blocked based on the fingerprint, making PLAN-based blocking better.

  • PLAN:
    When caught by RUNAWAY_WATCHES, all queries with the same execution plan within the specified time period are blocked from execution.
    Advantages: Similar to SIMILAR, but after optimizing the execution plan by adding indexes or collecting statistics, the changed plan will not be blocked (though if system resources are still busy, high latency may still cause RUNAWAY_WATCHES to catch it).
    Disadvantages: Different SQLs with the same PLAN might be mistakenly killed (e.g., downstream system full table scans). However, tests show that different SQLs with the same execution plan have different plan IDs, so this issue should not occur.

Therefore, to avoid query accumulation during failures, action=kill with watch=SIMILAR or PLAN is generally chosen. However, regardless of the watch choice, under resource constraints, even normally low-latency SQLs may experience jitter and be repeatedly killed.

Recommendations:

  1. Optimize query observability, refine overall execution time, and add metrics like CPU time, wait time (network wait, tikv’s total_suspend_time wait, IO wait, etc.). Judging by “actual long execution time and resource consumption” dimensions like CPU time would be more accurate.
  2. Add auxiliary means for secondary judgment, such as checking the recent average execution time of the query (if watch=PLAN, further filter by plan) to assist in deciding whether to add the query to RUNAWAY_WATCHES.
| username: Roger_Song | Original post link

Thank you for the analysis. Based on a few suggestions:

  1. In the next LTS, we have planned to include resource-oriented evaluation metrics. PROCESSED_KEYS and REQUEST_UNIT reflect the resource consumption on the IO side. In recent versions, these two metrics can be observed in STATEMENTS_SUMMARY and SLOW LOG, allowing users to make judgments accordingly. In the future, we plan to incorporate CPU metrics into system tables for observation and then consider providing corresponding runaway judgment standards.

  2. This is a good direction, similar to automatically maintaining a WATCH list based on policies. After workload persistence, we will think about how to improve this experience.

| username: h5n1 | Original post link

Can this feature be integrated into the dashboard for a graphical user interface operation?

| username: Roger_Song | Original post link

There are plans to add a Watch option for each SQL in SQL Statements, similar to quick binding.

| username: 人如其名 | Original post link

From the slow log text, there is RU-related information:

But this information is not available in the dashboard:

Could this be considered a small bug in the dashboard?

| username: ziptoam | Original post link

A very in-depth analysis, worth learning from.