TiKV CPU Spikes to Full Usage

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiKV CPU瞬间拉满

| username: 重启试试

[TiDB Usage Environment] Production Environment
[TiDB Version] 5.4.1
[Resource Configuration] 32c 180G 1T SSD
[Attachment: Screenshot/Log/Monitoring]
TiKV’s CPU suddenly maxes out, and the read/write volume of TiKV jumps from 1G-2G directly to around 7G, causing a large number of business restarts.

Accompanied by a large number of slow queries, has anyone encountered the same phenomenon?

| username: 重启试试 | Original post link

The raftstore error will also increase during the same period.

| username: 我是咖啡哥 | Original post link

Check slow SQL on the dashboard.

| username: 重启试试 | Original post link

Slow SQL queries that usually take 60ms-300ms can suddenly take 10+ seconds or even 100+ seconds when issues arise.

| username: 重启试试 | Original post link

The TiKV logs often show leader election failures.

| username: 重启试试 | Original post link

When IO util decreases, the MBps of TiKV increases, and the two metrics are inversely related.

| username: WalterWj | Original post link

This is generally caused by full SQL. Check the expensive SQL in the TiDB logs during that period. Alternatively, you can match the “slow” keyword in tikv.log to see if there are any large tasks. According to the monitoring, the network read has increased.

| username: tidb菜鸟一只 | Original post link

This is generally caused by a chain reaction of slow SQL. Check the logs around 22:40 to see if there are any particularly large SQL queries.

| username: 裤衩儿飞上天 | Original post link

Eliminating slow SQL can solve 80% of the problems.

| username: ohammer | Original post link

Is there a new feature launched in the business? There is a sudden spike in read requests.

| username: DBRE | Original post link

Check the slow SQL and see if the execution plan has changed.

| username: Lucien-卢西恩 | Original post link

Refer to this teacher’s suggestion to check the expensive SQL. By looking at the MBps provided by TiKV, the main issue is that the read traffic has reached the GiB level. Focus on read requests and Slow queries.

Refer to the slow query troubleshooting documentation.

| username: xingzhenxiang | Original post link

Caused by slow SQL. Additionally, I found that right joins perform poorly in practice, so you might want to try rewriting them as left joins. If possible, I have a suggestion to kill SQL based on duration and memory dimensions, and then analyze it. This way, it won’t affect the external services provided by the production environment.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.