[TiDB Usage Environment] Production Environment
[TiDB Version] 5.4.1
[Resource Configuration] 32c 180G 1T SSD
[Attachment: Screenshot/Log/Monitoring]
TiKV’s CPU suddenly maxes out, and the read/write volume of TiKV jumps from 1G-2G directly to around 7G, causing a large number of business restarts.
This is generally caused by full SQL. Check the expensive SQL in the TiDB logs during that period. Alternatively, you can match the “slow” keyword in tikv.log to see if there are any large tasks. According to the monitoring, the network read has increased.
Refer to this teacher’s suggestion to check the expensive SQL. By looking at the MBps provided by TiKV, the main issue is that the read traffic has reached the GiB level. Focus on read requests and Slow queries.
Refer to the slow query troubleshooting documentation.
Caused by slow SQL. Additionally, I found that right joins perform poorly in practice, so you might want to try rewriting them as left joins. If possible, I have a suggestion to kill SQL based on duration and memory dimensions, and then analyze it. This way, it won’t affect the external services provided by the production environment.