TiDB long running queries

Application environment:

Production

TiDB version: v4.0.14

Problem:

There are situations when queries taking longer time to execute result in high cpu usage and high I/O operations which tend to bottleneck of the tidb cluster. Is there a way to find such long running queries and any way to kill them in TiDB v4.0.14?

To identify and terminate long-running queries in TiDB v4.0.14 that cause high CPU and I/O usage, follow this detailed action plan:

Step 1: Monitoring Techniques

  1. Enable Slow Query Log: Ensure that the slow query log is enabled. This log records SQL statements whose execution time exceeds a specified threshold (default is 300 milliseconds). You can adjust this threshold using the tidb_slow_log_threshold variable.

  2. Use TiDB Dashboard: Access the TiDB Dashboard to analyze slow queries. Navigate to the Slow Queries page to view and filter slow queries based on time range, databases, and SQL keywords.

  3. Query Slow Logs: Use SQL commands to query the INFORMATION_SCHEMA.SLOW_QUERY table for slow queries. For example, to find slow queries on each TiDB node, use:

    SELECT instance, count(*) FROM information_schema.cluster_slow_query WHERE time >= "2025-02-17 00:00:00" AND time < NOW() GROUP BY instance;
    
  4. Analyze Execution Plans: Use the EXPLAIN ANALYZE statement to understand the execution plan and identify bottlenecks in query execution.

  5. Use pt-query-digest: Analyze the slow query logs with pt-query-digest to get a detailed report on query performance.

Step 2: Resource Analysis

  1. Monitor CPU and I/O Usage: Use the TiDB Performance Overview Dashboard to monitor CPU and I/O usage. Check the TiDB and TiKV CPU usage and I/O throughput to identify resource bottlenecks.

  2. Identify Hotspots: Look for unbalanced CPU usage or read/write hotspots in the cluster. This can indicate queries that are consuming excessive resources.

  3. Check Key Metrics: Focus on key metrics such as process_time and wait_time to identify queries that are causing high resource usage.

Step 3: Termination Methods

  1. Set OOMAction: Configure the OOMAction parameter to either log or cancel SQL execution when memory usage exceeds the MemQuotaQuery limit. This helps prevent long-running queries from consuming excessive resources.

  2. Use ADMIN Commands: Use the ADMIN SHOW SLOW command to identify and analyze the slowest queries. For example:

    ADMIN SHOW SLOW TOP 10;
    
  3. Terminate Problematic Queries: If a query is identified as problematic and consuming excessive resources, you can terminate it using the KILL command. Obtain the connection ID from the slow query log or the ADMIN SHOW PROCESSLIST command, and then execute:

    KILL <connection_id>;
    
  4. Adjust Query Execution Plans: If a query is identified as inefficient, consider optimizing its execution plan by adding indexes or rewriting the query to improve performance.

By following these steps, you can effectively monitor, analyze, and terminate long-running queries in TiDB v4.0.14, ensuring optimal performance and resource utilization.

Hi @Pranit_Rahangdale , slow queries will be recorded. You can refer to the archived documentation for v4.0.14 below.

The two links above are for tracking checks after SQL execution, but if the SQL is still executing, you need to query using CLUSTER_PROCESSLIST.

To terminate: You can terminate it by using kill tidb [process ID]. However, a limitation in v4.0.14 is that you can only terminate queries on the current node. For example, if the TiDB nodes are 11, 22, and 33, you can only connect to node 11 to kill the queries on node 11, but not connect to node 22 to kill the queries running on node 11. This means you cannot connect to the cluster using a load balancer IP, you must connect using the specific TiDB node IP.

@Hazel Is there any workaround possible to kill the long running queries completely from all the tidb nodes. In our production environment, we have 5 tidb nodes and many situations of unoptimized read and write and queries. We need to kill these queries completely from the TiDB cluster. If there is any guardrail around this, can you please suggest?

Hi @Pranit_Rahangdale You can check the ‘information_schema.cluster_processlist’ system table section in this document. PROCESSLIST.
But please note that you need to connect to ‘HOST’ to kill the query.