TiDB Service Process CPU Usage Suddenly Spikes Continuously

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB tidb服务进程CPU使用率突然连续暴增

| username: Curiouser

To improve efficiency, please provide the following information. A clear problem description can help resolve the issue faster:

[TiDB Related Information]
TiDB V5.1.0 with two PDs, two TiDBs, and three TiKVs
image

[Phenomenon Problem]

On October 14th at 2 PM, one of the TiDB nodes reported that the CPU exceeded the threshold and has remained high since then.

[Problem Alert Screenshot]

Node CPU Monitoring Chart


Node Top Real-time Chart

Grafana TiDB Monitoring Information Chart

[Problem Assistance]

How to troubleshoot the CPU surge? How to find the root cause of the CPU surge?

| username: db_user | Original post link

The sudden increase in TiDB’s CPU usage is generally caused by slow queries. You can check if there are any slow queries or locks on the current node.

Additionally, you need to check if there are any scenarios where automatic analyze has failed and how it is configured.

| username: Curiouser | Original post link

This is the slow SQL query result.

SELECT
	DB AS 'Database',
	`Query` AS 'SQL Statement',
	Query_time AS 'Execution Time Query_time',
	Time AS 'Execution Time',
	`User` AS 'Execution User',
	`Host` AS 'Host',
	Total_keys AS 'Total Scanned Keys Total_keys',
	Index_names AS 'Involved Index Names'
FROM
	INFORMATION_SCHEMA.SLOW_QUERY 
ORDER BY
	Query_time DESC 
	LIMIT 100;

| username: db_user | Original post link

  1. Use SHOW VARIABLES LIKE 'analyze%'; to check the results. If the issue is caused by analyze, you can adjust the start and end time range of the analyze.

  2. Additionally, use SHOW ANALYZE STATUS; or SELECT * FROM information_schema.ANALYZE_STATUS; to check if there are any failed automatic analyze tasks.

  3. Check the heatmap on the dashboard to see if there are any read hotspot issues.

| username: Curiouser | Original post link

1:

show variables like '%ANALYZE%' shows no related variables and configurations.
image

2:

SELECT job_info, state, start_time FROM information_schema.ANALYZE_STATUS LIMIT 500; query results, all jobs are finished.
image

3: Heatmap

| username: db_user | Original post link

  1. “analyze” should be lowercase, not uppercase.
  2. It currently doesn’t look like the failure was caused by “analyze”.
  3. Change the byte write amount to either “all” or the two related to reading.
| username: db_user | Original post link

Here are two suggestions you can try adjusting:

  1. Change tidb_analyze_version to 1. This is because version 2 is an experimental feature in version 5.1 and has some issues. I remember there were a few issues related to this.

  2. Shorten the end_time and start_time. You can set it to one to two hours during off-peak business hours. Check the results after making the changes.

| username: Curiouser | Original post link

Sure, thanks.

| username: db_user | Original post link

You’re welcome. Once it’s done, you can observe if there are any effects.

| username: tidb狂热爱好者 | Original post link

The essence is tidboom. The reason is data issues. The SQL is slow SQL.

| username: Curiouser | Original post link

@db_user @tidb Enthusiast

I found a lot of slow SQL queries. I want to manually kill these processes to see if the CPU usage can decrease, but I encountered another problem. Using kill tidb + process ID doesn’t work. :joy: Related issue link: show processlist的进程号杀不掉,且sleep时间超过time_out的限制时间,为什么不断开 - #3,来自 Hi70KG - TiDB 的问答社区

image

| username: db_user | Original post link

If it happens on the corresponding instance (if you can’t kill it on TiDB using SQL), then it’s a bug. I resolved it by restarting the TiDB node.

| username: 近墨者zyl | Original post link

To handle slow SQL, you need to optimize it by adding indexes or changing the logic and writing of the SQL. In short, try to minimize the amount of data returned from TiKV to the TiDB server, or implement flow control on the frontend.

| username: Curiouser | Original post link

OK, I’ll try restarting that node tonight.

| username: Curiouser | Original post link

As long as the data operation has a slightly higher computational complexity, developers who can solve it with SQL resolutely do not want to write code to solve it. They will only say, why is the database performance so poor :joy:

| username: 近墨者zyl | Original post link

Ask the leadership to communicate more with the developers. If operations and development are disconnected and each is busy with their own tasks, it will cause issues for both the application system and the database. Establish development standards, and ensure that development testing, stress testing, and DBA involvement are all part of the process.

| username: alfred | Original post link

Haha, everyone wants to take the easy way out.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.