How to Handle the Issue of a TiKV Node's IO Being Fully Occupied

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 一个tikv节点的io占满,如何处理这个问题

| username: TiDBer_Y2d2kiJh

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] v5.4.0 2tidb 3pd 3tikv
[Reproduction Path] One TiKV’s IO is always full, causing all business queries to slow down, and all select query statements have been killed.
[Encountered Problem: Problem Phenomenon and Impact]
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachment: Screenshot/Log/Monitoring]

| username: TiDBer_oHSwKxOH | Original post link

Post the topsql seen on the dashboard.

| username: TiDBer_oHSwKxOH | Original post link

You see, the visualization interface can see the triggers.

| username: TiDBer_Y2d2kiJh | Original post link

Is there any way to reduce the IO of this TiKV first?

| username: TiDBer_Y2d2kiJh | Original post link

Currently, the business query volume is not high.

| username: TiDBer_oHSwKxOH | Original post link

Your business, as I see it, is mostly statistical. If you also have settlement TP settlement business, it needs to be separated.

| username: TiDBer_oHSwKxOH | Original post link

The image is not visible. Please provide the text you need translated.

| username: TiDBer_oHSwKxOH | Original post link

Please post the machine configuration in the thread. It’s impossible to play with a regular mechanical disk. In the cloud, the type of storage is very important.

| username: 像风一样的男子 | Original post link

Is only one KV’s CPU high? Are the others normal? Check if there are any errors in this KV’s logs.

| username: ShawnYan | Original post link

Did Boss Jiang use a secondary account?


The post is missing details on machine configuration and specific descriptions of the phenomenon.
Is it only this one TiKV that is slow, or are other TiKVs also slow? Are you sure it’s this SQL causing it?
Is it a typical read hotspot issue?

| username: 有猫万事足 | Original post link

Using topsql to see which statements are consuming IO and then optimizing them is a reliable approach.

However, it seems you are not concerned about the cause and just want to immediately reduce IO. :joy:
In that case, evicting the leader on this TiKV might be more reliable.

| username: zhanggame1 | Original post link

First, confirm whether the disk I/O is caused by the TiKV process and not by other services running on it. Then check the IOPS and read/write volume to see if the I/O is indeed high and not due to a hard disk issue.

| username: redgame | Original post link

Check the status of the TiKV nodes to confirm if there are any hardware failures or network issues, as these could lead to sustained I/O saturation.

| username: TiDBer_vfJBUcxl | Original post link

Use the iotop command to observe which process has a high I/O percentage.

| username: TiDBer_oHSwKxOH | Original post link

TiDB’s distributed database can either improve machine performance by addition or reduce business reads and optimize SQL by subtraction.

| username: TiDBer_Y2d2kiJh | Original post link

Before planning to restart the cluster, I did a dumpling backup (because the data volume is not large). During the dumpling backup, the IO of this node was released, and I don’t know the reason.

| username: 有猫万事足 | Original post link

A bit mysterious. :joy: But as long as it solves the problem, it’s good.

| username: TiDBer_Y2d2kiJh | Original post link

The same situation as yesterday has occurred again.

| username: 扬仔_tidb | Original post link

1: First, determine if it’s a system failure.
2: Use topsql to check if the execution plan involves scanning too many rows.
3: Check if there are a large number of writes or queries causing hot data; this can be found in the dashboard.
4: See if this parameter helps: TiKV 配置文件描述 | PingCAP 归档文档站

| username: TiDB_C罗 | Original post link

  1. Confirm which process is causing the I/O.
  2. Further address the issue accordingly.
  3. If it’s TiDB, check the heatmap for a clear view.