High IO Utilization on a Specific TiDB Node

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb某一个节点io util利用率很高

| username: 张小凉1

[TiDB Usage Environment] Production Environment
[TiDB Version] v7.1.2
[Encountered Problem: Phenomenon and Impact] Suddenly, the IO utilization rate of a certain TiDB node is very high
[Resource Configuration] The cluster has a total of 3 nodes, each node is deployed with 1 PD, 1 KV, and 1 TiDB-server
[Attachment: Screenshot/Log/Monitoring]
Enterprise WeChat Screenshot_16989807706621

| username: 像风一样的男子 | Original post link

Generally, TiDB does not have read and write operations. Check if the logs in the directory have increased significantly.

| username: h5n1 | Original post link

Which disk is higher, node exporter or disk performance? Which component is deployed?

| username: 张小凉1 | Original post link

There are about 9 more log files than the other two nodes.

| username: 张小凉1 | Original post link

This node has an NVMe disk attached, and it has deployed 1 PD, 1 TiDB server, and 1 KV.

| username: h5n1 | Original post link

Can you check the screenshot to see when the spike occurred? Also, check if the “thread cpu → unified pool” in TiKV detail has increased.

| username: 张小凉1 | Original post link

It should have been a while, I can’t find the exact time point when it suddenly spiked.

| username: h5n1 | Original post link

Are there export and backup tasks?

| username: 张小凉1 | Original post link

Yes. It is indeed connected to this node.

| username: 张小凉1 | Original post link

However, it is executed on a scheduled basis.

| username: Fly-bird | Original post link

Check which service is consuming high CPU.

| username: 像风一样的男子 | Original post link

I misread it, I thought it was a TiDB node, but it turns out to be a mixed deployment.

| username: 张小凉1 | Original post link

tidb-server is occupying a high amount of resources.

| username: tidb菜鸟一只 | Original post link

See if it’s caused by the automatic collection of statistics tasks. Search for the auto analyze logs in the tidb-server logs.

| username: zhanggame1 | Original post link

For hybrid deployment, you can use the iotop command to see which specific component is using the disk more and the specific IO values.

| username: 张小凉1 | Original post link

The tikv-server writes logs to tikv.log. Most of the time, it’s tens of KB/s, and at high times, it’s over 100 KB/s.

| username: zhanggame1 | Original post link

This is the normal state, not when there is a problem, right?

| username: 有猫万事足 | Original post link

Lightning import? Is it necessary to connect directly to a single node?

| username: 张小凉1 | Original post link

It’s not a lightning import.

| username: 有猫万事足 | Original post link

Check the topsql interface to see what is being executed.