Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: 磁盘IO一直很高
[TiDB Usage Environment] Production Environment
[TiDB Version] v4.0.12
[Reproduction Path] Operations performed that led to the issue
[Encountered Issue: Problem Phenomenon and Impact]
Currently, the IO usage of the KV nodes is consistently high, as shown in the first image. The corresponding KV nodes, as shown in the second image, have varying levels of IO usage. There are over 100 KV nodes in total, but only about a dozen consistently exhibit high IO usage. Following the official manual for troubleshooting did not clearly reveal any hot read or hot write issues.
The business impact is slow write operations, with no reported issues on read operations.
None of the large tables have auto-increment primary keys or sharding. It seems like the issue might be due to concentrated writes on large tables, leading to uneven data distribution. Has anyone encountered a similar issue?
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]
The TiKV duration is very high.
The image you provided is not accessible. Please provide the text you need translated.
Take a look at tikv-details->rocksdb-kv
Inside, there are compact-flow, write-flow, and read-flow.
Check what each of these values is.
There is also pending_compact_bytes.
These reflect the amount of disk read and write.
Additionally, there is write duration and similar metrics.
Sorry, I cannot translate images directly. Please provide the text content for translation.
This makes it clear, the Scan is relatively high.
There are differences in node deployment compared to the official manual. The file system is XFS, and the mount options do not include nodelalloc and noatime. I am not sure if these two options have a significant impact on IO?
How is the disk read/write speed?
The image cannot be translated directly. Please provide the text content for translation.
It seems that this is caused by reading. You can analyze the slow log situation of several TIDB-SERVER points and the slow log of TIKV. By analyzing the slow logs and the corresponding REGIONS, you can basically determine which one is causing it.
If the disk I/O of TiKV is consistently high, it may be due to the following reasons:
- Hotspot Reads/Writes: If the data on a TiKV node is frequently read or written, it will lead to high disk I/O on that node. You can use the
pd-ctl
tool to check hotspot information with the commands pd-ctl hot read
and pd-ctl hot write
. If you find that a certain Region has high read/write hotspots, consider splitting that Region or adjusting the replica distribution to reduce the load on that node.
- Large Table Writes: If the data on a TiKV node is written frequently, it will lead to high disk I/O on that node. You can use the
pd-ctl
tool to check the write traffic of TiKV nodes with the command pd-ctl store stats <store_id>
. If you find that a certain TiKV node has high write traffic, consider scaling out that node or sharding the data to reduce the load on that node.
- Insufficient Disk Space: If the disk space on a TiKV node is insufficient, it will lead to high disk I/O. You can use the
df -h
command to check disk space usage. If you find that a certain TiKV node has insufficient disk space, consider scaling out that node or cleaning up unnecessary data on the disk.
- Insufficient Disk Performance: If the disk performance on a TiKV node is insufficient, it will lead to high disk I/O. You can use the
iostat
command to check the disk’s I/O performance. If you find that a certain TiKV node has insufficient disk performance, consider upgrading that node or replacing it with faster disks.
Thank you all for your help, the issue has been identified.
It was caused by the upstream Flink usage. Currently, we are using stream processing for data ingestion. After switching to batch processing, disk IO has significantly decreased, and write performance has improved by more than a hundredfold.
The next step is to optimize the use of Flink. If we continue to use stream processing directly, it seems that the disk IO pressure on the TiDB side is still too high.
This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.