TiKV Anomalies

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv 异常

| username: 表渣渣渣

[TiDB Usage Environment] Production Environment
[TiDB Version] 5.4.0
[Reproduction Path] No abnormal situation found in the production environment
[Encountered Problem: Problem Phenomenon and Impact] All queries are abnormal, and the server CPU is fully occupied by tikv-service
[Resource Configuration] 64G *3, 3tikv

Shortly after, the data platform reported an error, and all interfaces timed out. A random SQL query also timed out.

Checking the server resources, out of the three tikv servers, only one had its CPU fully occupied, while the others were normal. Upon checking the server, the CPU was fully occupied by tikv-service, so

Checked tikv error information:
[ERROR] [kv.rs:1167] [“KvService response batch commands fail”] [err=“"SendError(…)"”]

Then checked the current tidb query process;
SHOW PROCESSLIST;
kill tidb process; but it had no effect;

As an emergency measure, restarted the single-node tikv component
After restarting the single-node tikv, there were no abnormalities;

The logs are quite large, and the abnormal information is repeatedly as follows:
[2023/04/26 15:26:53.404 +08:00] [ERROR] [kv.rs:1167] [“KvService response batch commands fail”] [err=“"SendError(…)"”]
[2023/04/26 15:26:53.404 +08:00] [WARN] [endpoint.rs:606] [error-response] [err=“Coprocessor task canceled due to exceeding max pending tasks”]
[2023/04/26 15:26:53.404 +08:00] [ERROR] [kv.rs:1167] [“KvService response batch commands fail”] [err=“"SendError(…)"”]
[2023/04/26 15:26:53.404 +08:00] [WARN] [endpoint.rs:606] [error-response] [err=“Coprocessor task canceled due to exceeding max pending tasks”]

tikv.log (46.9 MB)

| username: tidb狂热爱好者 | Original post link

Stop all TiDB servers.

| username: tidb狂热爱好者 | Original post link

You were killed by slow SQL.

| username: tidb狂热爱好者 | Original post link

After restarting TiDB, immediately connect using the MySQL client and analyze the SQL:

SELECT FLOOR(UNIX_TIMESTAMP(MIN(summary_begin_time))) AS agg_begin_time, 
       FLOOR(UNIX_TIMESTAMP(MAX(summary_end_time))) AS agg_end_time, 
       ANY_VALUE(digest_text) AS agg_digest_text, 
       ANY_VALUE(digest) AS agg_digest, 
       SUM(exec_count) AS agg_exec_count, 
       SUM(sum_latency) AS agg_sum_latency, 
       MAX(max_latency) AS agg_max_latency, 
       MIN(min_latency) AS agg_min_latency, 
       CAST(SUM(exec_count * avg_latency) / SUM(exec_count) AS SIGNED) AS agg_avg_latency, 
       CAST(SUM(exec_count * avg_mem) / SUM(exec_count) AS SIGNED) AS agg_avg_mem, 
       MAX(max_mem) AS agg_max_mem, 
       ANY_VALUE(schema_name) AS agg_schema_name, 
       ANY_VALUE(plan_digest) AS agg_plan_digest, 
       query_sample_text, 
       index_names 
FROM `INFORMATION_SCHEMA`.`CLUSTER_STATEMENTS_SUMMARY_HISTORY` 
WHERE index_names IS NULL AND query_sample_text > '' 
GROUP BY schema_name, digest 
ORDER BY agg_sum_latency DESC 
LIMIT 1;

This can help identify the slowest table or add an index. I have a tool here for automatically adding indexes in TiDB.

| username: 表渣渣渣 | Original post link

No data after querying.