Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.Original topic: TiDB高TPS表统计信息失效
Background
This week, on the 23rd and 25th, starting from midnight, a certain log business application’s connection pool was filled, causing new read and write requests to fail and resulting in numerous errors in the program. This business application performs both reads and writes in TiDB, involving TiDB version 5.1.4.
Analysis
- From the Grafana performance monitoring, it was found that starting from midnight, the P999 latency time soared to around 2 minutes, and the CPU utilization of the TiKV nodes was nearly saturated. The business application side has not had any new deployments recently (this business application has been running stably for several months).
- From the TiDB dashboard heatmap, it was found that starting from midnight on the 23rd, the read volume of the xxx_23 and xxx_25 tables soared, with no anomalies at other times.
- In the slow query log, the update delay for the xxx_25 table was 7-8 minutes (normally the delay is within tens of milliseconds). The execution plan showed that the update was a full table scan (there is a composite index on xxx_id and xxx2_id), with millions of rows in the table. Multiple updates that did not use the index caused full table locks, resulting in lock waits and further increasing the delay time.
UPDATE
xxxx_25
SET
execute_status = 2,
msg = ''
WHERE
xxx_id = 6342
AND xxx2_id = 'xxxx';
- The automatic table statistics collection time is configured from 01:00 to 08:00.
mysql> show global variables like '%tidb_auto_analyze%';
+------------------------------+-------------+
| Variable_name | Value |
+------------------------------+-------------+
| tidb_auto_analyze_end_time | 08:00 +0800 |
| tidb_auto_analyze_ratio | 0.5 |
| tidb_auto_analyze_start_time | 01:00 +0800 |
+------------------------------+-------------+
- This business application uses a daily partitioned table scenario xxx_01 to xxx_31, with large update and insert volumes, and the auto-increment ID is generated sequentially; TiDB’s automatic statistics collection behavior version is 1.
mysql> show global variables like '%tidb_analyze_%';
+----------------------+-------+
| Variable_name | Value |
+----------------------+-------+
| tidb_analyze_version | 1 |
+----------------------+-------+
Handling
- After manually collecting the table statistics, the read and write volume and latency of the table returned to normal. Subsequently, a scheduled task was added to separately collect statistics for high TPS tables.
ANALYZE TABLE xxxx_25;
- The question is, besides separately collecting table statistics for high TPS tables, is there any other way to ensure that statistics are automatically and correctly collected?