[TiDB Version] 5.4.1
[Encountered Issue] Around 3 PM, TiDB queries suddenly slowed down, and DM also stopped syncing. Later, it was discovered that the CPU of one of the TiKV nodes (out of a total of 7 TiKV nodes) was very high. After stopping this TiKV node, the cluster returned to normal.
The distribution includes the kv monitoring and the kv log files of the faulty node (192.168.10.70) (the content is too much, INFO level logs have been filtered out).
Fault duration: 25th from 15:00 to 17:27. During this period, a round of kv, tiflash, and db node restarts were performed with no effect. SQL was very slow whether querying or updating (the qps of SQL was lower than during normal periods because some traffic had already been diverted). After stopping the faulty node at 17:27, the machine’s SQL processing returned to normal.
This analysis doesn’t explain why the cluster returned to normal after stopping this KV node; it doesn’t seem to be the root cause.
Additional information:
This node has a better CPU configuration compared to other KV nodes. During the failure, memory was normal, disk capacity was below 60%, IO usage decreased, IO volume decreased, and network traffic decreased (I understand these decreases because the traffic decreased). The anomaly: CPU usage was much higher compared to the other six normal KV nodes.
Looking at 15:00, the raft write requests were very high. The write latency also increased. Was there any change in your business at that time?
Also, was there any change in the machine’s IO situation?
The read requests on machine 70 are also very high and unbalanced. Is there a hotspot?