Statistics Information Invalid for High TPS Tables in TiDB

translator_bot · June 22, 2024, 10:44pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB高TPS表统计信息失效

| username: jiaxin

Background

This week, on the 23rd and 25th, starting from midnight, a certain log business application’s connection pool was filled, causing new read and write requests to fail and resulting in numerous errors in the program. This business application performs both reads and writes in TiDB, involving TiDB version 5.1.4.

Analysis

From the Grafana performance monitoring, it was found that starting from midnight, the P999 latency time soared to around 2 minutes, and the CPU utilization of the TiKV nodes was nearly saturated. The business application side has not had any new deployments recently (this business application has been running stably for several months).

From the TiDB dashboard heatmap, it was found that starting from midnight on the 23rd, the read volume of the xxx_23 and xxx_25 tables soared, with no anomalies at other times.

In the slow query log, the update delay for the xxx_25 table was 7-8 minutes (normally the delay is within tens of milliseconds). The execution plan showed that the update was a full table scan (there is a composite index on xxx_id and xxx2_id), with millions of rows in the table. Multiple updates that did not use the index caused full table locks, resulting in lock waits and further increasing the delay time.

UPDATE
  xxxx_25
SET
  execute_status = 2,
  msg = ''
WHERE
  xxx_id = 6342
  AND xxx2_id = 'xxxx';

The automatic table statistics collection time is configured from 01:00 to 08:00.

mysql> show global variables like '%tidb_auto_analyze%';
+------------------------------+-------------+
| Variable_name                | Value       |
+------------------------------+-------------+
| tidb_auto_analyze_end_time   | 08:00 +0800 |
| tidb_auto_analyze_ratio      | 0.5         |
| tidb_auto_analyze_start_time | 01:00 +0800 |
+------------------------------+-------------+

This business application uses a daily partitioned table scenario xxx_01 to xxx_31, with large update and insert volumes, and the auto-increment ID is generated sequentially; TiDB’s automatic statistics collection behavior version is 1.

mysql> show global variables like '%tidb_analyze_%';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| tidb_analyze_version | 1     |
+----------------------+-------+

Handling

After manually collecting the table statistics, the read and write volume and latency of the table returned to normal. Subsequently, a scheduled task was added to separately collect statistics for high TPS tables.

ANALYZE TABLE xxxx_25;

The question is, besides separately collecting table statistics for high TPS tables, is there any other way to ensure that statistics are automatically and correctly collected?

translator_bot · June 22, 2024, 10:44pm

| username: 裤衩儿飞上天 | Original post link

The automatic task did not collect successfully, so it can only be executed manually. Otherwise, you can write a script yourself to check the execution results regularly.

translator_bot · June 22, 2024, 10:44pm

| username: 我是咖啡哥 | Original post link

Write your own script to collect it. I encountered a situation before where the default collection couldn’t be completed in a few hours and failed, but manual collection only took a few minutes.

translator_bot · June 22, 2024, 10:44pm

| username: buddyyuan | Original post link

Your tidb_auto_analyze_ratio is set too high, it might be difficult to trigger automatically at this ratio.

translator_bot · June 22, 2024, 10:44pm

| username: TIDBMonster | Original post link

Write logs and set timers

translator_bot · June 22, 2024, 10:44pm

| username: Raymond | Original post link

The automatic collection of table statistics is configured from 01:00 to 08:00. This time range for collecting statistics seems a bit short.

translator_bot · June 22, 2024, 10:44pm

| username: jiaxin | Original post link

In the production environment, high traffic during the day consumes a lot of TiDB node memory, and automatic statistics collection during the day may cause TiDB nodes to OOM. This has happened before in production.

translator_bot · June 22, 2024, 10:44pm

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.