How to Identify What TiKV is Doing When TiKV IO Utilization is Fully Loaded

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv IO utilization 满载,如何定位tikv在干什么

| username: TiDB_C罗

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version]
[Reproduction Path] What operations were performed that caused the issue
[Encountered Issue: Problem Phenomenon and Impact]
[Resource Configuration]
IO utilization is close to 100%


tidb

slowlog

gc


grpc

rocksdb compact


scheduling tasks

| username: TiDB_C罗 | Original post link

I turned off GC

set global tidb_gc_enable=off;

but there is still no downward trend.

| username: 像风一样的男子 | Original post link

Is the overall database latency high? Are there many slow queries?

| username: TiDB_C罗 | Original post link

The duration and slowlog monitoring are not synchronized with TiKV IO.

| username: zhanggame1 | Original post link

Let’s check the actual IO speed.

| username: 像风一样的男子 | Original post link

In the dashboard, there is a TopSQL feature that allows you to see the SQL resource usage of each node.

| username: wzf0072 | Original post link

Check the slow queries on the dashboard to see which SQL statements the system is executing and whether there are large tables being analyzed.

| username: TiDB_C罗 | Original post link

Excluding analyze, I added begin and end execution intervals.

| username: TiDB_C罗 | Original post link

The result of perf top

| username: 小龙虾爱大龙虾 | Original post link

Based on my personal experience, I generally don’t look at that. The SSD is often full, and more attention should be paid to throughput, IOPS, read/write response time, and other metrics. If you want to see what kind of IO TiKV is writing, you can check the TiKV-detail → IO breakdown.

| username: h5n1 | Original post link

Looking at the QPS, it has increased, but strangely, the latency has decreased.

| username: 裤衩儿飞上天 | Original post link

Is this node running other services?

| username: 路在何chu | Original post link

Check the disk latency; sometimes IO util is not accurate.

| username: buptzhoutian | Original post link

Even if it is 100% for a long time, it does not directly indicate that the disk is “fully loaded.”

There is a dedicated Dashboard for displaying disk metrics called Disk-Performance. This Dashboard also has a graph for Disk IO Utilization, and the official documentation provides an explanation:

Shows disk Utilization as percent of the time when there was at least one IO request in flight. It is designed to match utilization available in iostat tool. It is not a very good measure of true IO Capacity Utilization. Consider looking at IO latency and Disk Load Graphs instead.

The metrics used in this graph come from node_exporter’s node_disk_io_time_seconds_total, and the expression used by TiDB is:

rate(node_disk_io_time_seconds_total[$interval])

This result corresponds to the %util column in the iostat tool, as described in the manual:

man iostat

%util
    Percentage of elapsed time during which I/O requests were
    issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close
    to 100% for devices serving requests serially. But for devices serving requests in parallel, such as RAID arrays
    and modern SSDs, this number does not reflect their performance limits.
| username: 路在何chu | Original post link

We are using AWS disks here, and the IO utilization is almost always full.

| username: 小龙虾爱大龙虾 | Original post link

It’s also normal. When the QPS is high, the 999 line is no longer sufficient to display the execution status of slower SQL queries, but those slow-executing SQL queries may not have disappeared. :grinning:

| username: dba远航 | Original post link

Check what the top IO ranks are doing.

| username: zhanggame1 | Original post link

This metric is not very accurate; you need to check if it is an IO bottleneck.

| username: 随缘天空 | Original post link

Try setting raftstore.sync-log to false and observe if this affects the IO situation.

| username: Kongdom | Original post link

  1. Check the slow queries on the dashboard.
  2. Use show full processlist to check the current processes.
  3. This is something we often encounter with our mechanical hard drives; they inherently have low I/O read and write speeds and are often fully utilized.