When the TiDB cluster encounters a large number of query requests, the IO read is very high, and restarting the entire cluster does not reduce the TiKV IO read

translator_bot · June 20, 2024, 2:10pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb集群遇到大量的查询请求，io读非常高，重启整个集群tikv的IO读无法下降

| username: vcdog

The TiDB cluster is encountering a large number of query requests, causing the IO read on the 3 KV nodes to be very high. Restarting the entire cluster does not reduce the TiKV IO read.

translator_bot · June 20, 2024, 2:10pm

| username: h5n1 | Original post link

First, look at the slow SQL.

translator_bot · June 20, 2024, 2:10pm

| username: vcdog | Original post link

I think the problem might be related to the configuration of the TiDB cluster. You can check the configuration files of PD, TiKV, and TiDB to see if there are any issues. Additionally, you can also check the logs of each component to find more detailed error information.

translator_bot · June 20, 2024, 2:10pm

| username: 像风一样的男子 | Original post link

Optimize slow queries.

translator_bot · June 20, 2024, 2:10pm

| username: vcdog | Original post link

I just shut down the gateway and the entire TiDB cluster for a few minutes, then restarted the cluster. After opening the gateway, TiDB requests increased, and TiKV’s IO read also started to rise continuously.

translator_bot · June 20, 2024, 2:10pm

| username: vcdog | Original post link

Okay, I’ll check the slow SQL.

translator_bot · June 20, 2024, 2:10pm

| username: TIDB-Learner | Original post link

Querying causes a large amount of disk I/O. First, check slow queries and optimize indexes. Second, look at Grafana (TiKV) to see where the slowness is. Check if there are any configuration issues. Think about whether there are any other operations.

translator_bot · June 20, 2024, 2:10pm

| username: lemonade010 | Original post link

Looking at the dashboard for slow SQL and optimizing slow SQL is the solution; restarting cannot solve the problem.

translator_bot · June 20, 2024, 2:10pm

| username: 迪迦奥特曼 | Original post link

Check slow SQL in the dashboard, investigate SQL with a large number of requests one by one.

translator_bot · June 20, 2024, 2:10pm

| username: vcdog | Original post link

The issue from yesterday has been resolved. The reason was that a business table with 6 million rows had low health, and a large number of queries on the business side were using this table for join queries. After performing ANALYZE TABLE table_name to collect statistics on this table, the IO read on TiKV immediately dropped.

The parameters that need to be adjusted are as follows:

-- 1. Enable the automatic statistics collection feature:
mysql> show variables like 'tidb_enable_auto_analyze';
+--------------------------+-------+
| Variable_name            | Value |
+--------------------------+-------+
| tidb_enable_auto_analyze | ON    |
+--------------------------+-------+
1 row in set (0.00 sec)

-- 2. Adjust the tidb_auto_analyze_ratio parameter from the original 0.7 to 0.3, which means that when more than 30% of the rows in the table are modified, automatic ANALYZE will be triggered to update.

mysql> show variables like 'tidb_auto_analyze_ratio';
+-------------------------+-------+
| Variable_name           | Value |
+-------------------------+-------+
| tidb_auto_analyze_ratio | 0.3   |
+-------------------------+-------+
1 row in set (0.00 sec)

-- 3. Change the start time to Beijing time, representing the period from 00:00 to 08:00 for automatic collection.

mysql> set global tidb_auto_analyze_start_time='00:00 +0800';
Query OK, 0 rows affected (0.01 sec)

mysql> set global tidb_auto_analyze_end_time ='08:00 +0800';
Query OK, 0 rows affected (0.03 sec)

translator_bot · June 20, 2024, 2:10pm

| username: 友利奈绪 | Original post link

I remember there is a parameter that can automatically kill the timed-out SQL.

translator_bot · June 20, 2024, 2:10pm

| username: 友利奈绪 | Original post link

Good case study.

translator_bot · June 20, 2024, 2:10pm

| username: 鱼跃龙门 | Original post link

First, check the slow SQL on the dashboard, as well as the indexes, and the statistics information of the slow SQL table, etc.

translator_bot · June 20, 2024, 2:10pm

| username: TiDBer_jYQINSnf | Original post link

The low health score leads to incorrect indexing, which is quite common. Analyze is one solution. Another is to bind the execution plan, so the table’s health score won’t affect it anymore.