TiKV IO Fully Utilized

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: Tikv IO 跑满

| username: 普罗米修斯

[TiDB Usage Environment] Production Environment
[TiDB Version] v5.2.4
[Reproduction Path] Due to incorrect tikv label configuration, recent operations: scaled up and down three tikv nodes, the scaling operation is now complete. The last tikv node (192.168.80.212) was taken offline, and it is shown as tombstone mode on the PD layer. Executed store remove-tombstone, then found that executing tiup cluster prune xx prompted that the already offline tikv node could not be found. Used tiup cluster scale-in xxx --force to remove the node (tikv下线步骤错误怎么清理集群下线tikv缓存 - TiDB 的问答社区);
[Encountered Problem: Phenomenon and Impact] Today, when checking the monitoring, it was found that the IO utilization rate of the online tikv nodes was almost 100%. Using iostat to check the IO usage rate of the tikv nodes showed consistently high usage. Using iotop to check the high usage process showed it was the tikv service. Checked the tikv logs, and all logs indicated attempts to contact the already offline tikv node (192.168.80.212). Unsure if this is causing the high IO, and how to completely remove this node since it is already physically gone.
[Resource Configuration]






[Attachments: Screenshots/Logs/Monitoring]
store 115162392 is the already offline 192.168.80.212


| username: 普罗米修斯 | Original post link

I checked and found that there are no regions on this node.

| username: WalterWj | Original post link

For NVME drives, don’t look at IO util, it’s not accurate. Check disk performance and see if the response time is high; normally it should be in the microsecond range. You can use fio to test the drive’s performance by looking at IOPS and read/write throughput. By comparing the usage rate, you can get an idea of how much performance is being utilized.

| username: zhanggame1 | Original post link

Are you referring to this latency for disk performance?


Or this one?

| username: 普罗米修斯 | Original post link


| username: 普罗米修斯 | Original post link

How to truly remove an offline TiKV node so that it is no longer accessed? I think frequent access and logging are related to this issue. Previously, the I/O wasn’t this high.

| username: WalterWj | Original post link

Is the data directory where you installed TiKV on an NVMe drive? Why does it look like the disk capacity for each TiKV in your monitoring is over 900 GB?

| username: 普罗米修斯 | Original post link

It’s an NVMe drive, each with 1TB.

| username: WalterWj | Original post link

Try increasing the low and high settings :thinking:.

| username: 普罗米修斯 | Original post link

The maximum usage is around 60%. Currently, there are no issues with region migration. The problem now is how to exclude the offline TiKV and prevent frequent access to it.

| username: WalterWj | Original post link

In the lower version, I remember that low should be 0.6. The offline TiKV will no longer be accessed unless the offline process was unsuccessful.

| username: 普罗米修斯 | Original post link

The offline process has been completed, and it is the same as this issue. However, I am not sure how it was handled later.

| username: WalterWj | Original post link

It seems like recently half of the posts are talking about this issue. Normally, on the official website, you just need to scale down and execute the prune operation after the tombstone. There’s no need to manually delete the information in PD…

| username: zhanggame1 | Original post link

I’m also quite puzzled, not sure if the old version is different from the new version.

Normally, it should be scale-in, then use tiup cluster display to check the status. When it can be deleted, it will prompt to execute the prune command, and the command is posted at the end of the display output.

| username: WalterWj | Original post link

So where did this operation come from? It’s a bit ridiculous…

| username: 普罗米修斯 | Original post link

Because the SSD storage capacity is sufficient, the low and high percentages in version 5 are 0.8 and 0.7, respectively, so there are no concerns regarding capacity. The TiKV logs report an error (invalid store ID) and frequently print logs, so the investigation direction for high IO has shifted to addressing this error and the frequent log printing.

The store remove-tombstone operation mentioned above is performed after scaling down, but Grafana still shows the decommissioned monitoring node abnormally. This is also introduced on the official website, you can check it out.

| username: zhanggame1 | Original post link

Executing store remove-tombstone only modifies PD and does not modify the monitoring, so Grafana will report an error. Normally, this should be handled using tiup.

| username: 普罗米修斯 | Original post link

Bro, don’t get hung up on this. We’ve encountered this situation many times and always handle it this way. After taking it offline, the TiKV status changes to tombstone, and Grafana still shows the node. Execute store remove-tombstone in pd-ctl, and the abnormal node display will disappear because the data is read from the 2379 PD.

| username: WalterWj | Original post link

This command is to clean up the TiKV tombstone information in PD.
The fully offline state here is after executing the scale in and prune operations.

| username: WalterWj | Original post link

:thinking: If prune has not been executed, check the node you are scaling down… see if the tikv process is still there…