Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: TIKV CPU 爆了
【TiDB Usage Environment】Production Environment /
【TiDB Version】v6.5.1
【Encountered Problem: Problem Phenomenon and Impact】
The CPU of the TIKV node is constantly above 95%
【Resource Configuration】
5 TIKV nodes, 16C32G
【Attachments: Screenshots/Logs/Monitoring】
What could be the reason?
Supplement TiKV logs, overview monitoring, TiKV detailed CPU monitoring, and the slow log situation on the dashboard. Also include TiDB monitoring. Common issues include hotspots, a large number of slow SQL queries, problems caused by automatic analyze, etc. Specific situations need to be analyzed further.
Were there any operations performed before and after the resource shortage?
When each SQL query is being executed, if the TiKV CPU usage is high, what could this indicate?
-
Data scale and resource allocation are unreasonable.
-
The usage method is not quite right, optimization needs to be done by referring to best practices.
-
There are too many slow queries, leading to resource occupation and long-term inability to release.
It is recommended to assess the business requirements and data scale to determine if the current resource allocation is sufficient. Then, use the Top N method in the Dashboard to identify and optimize the queries that consume the most resources.
In Grafana, within the TiKV-Details section, check the Thread CPU to see which part is consuming more CPU.
The CPU usage of TiKV nodes consistently staying above 95% could be due to various reasons. Here are some possible causes and solutions:
-
A large number of Tombstone keys have not completed the Compaction operation, causing a significant amount of SQL to scan a lot of useless data, leading to the TiKV’s Storage ReadPool CPU being fully utilized, ultimately making TiKV busy and slow in processing requests. You can check the TiKV instance’s CPU usage and Storage ReadPool CPU usage through the TiKV-details monitoring panel, and determine if there is an issue by checking if the metadata Region of the TiDB Server query cluster with application errors is on the TiKV instance with abnormal CPU usage. If this issue exists, it can be resolved through Compaction operations.
-
There is a “hotspot” issue caused by memory faults. You can detect hardware to determine if there are memory faults. If memory faults exist, you can first add a new server to the cluster, then remove the problematic server. You can expand the TiKV instance on the new server first, then gradually shrink the TiKV instances on the problematic servers. After completion, the cluster duration will return to its previous state, and the CPU usage of the TiKV servers will be more balanced.
-
There is a hotspot issue caused by abnormal SQL. You can analyze the slow logs to see if there are any abnormal SQL queries. If abnormal SQL queries exist, you can optimize them or adjust the configuration parameters of the TiKV cluster to resolve the issue.
-
There are other issues such as CPU, memory, IO, or network problems. You can check the CPU, memory, IO, and network status of each component server through the monitoring panel to determine if these issues exist. If these issues are present, you can optimize or adjust the configuration parameters of the TiKV cluster accordingly based on the specific situation.
I hope the above information can help you resolve the high CPU usage issue of TiKV nodes.
There should be slow SQL, right?
This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.