SQL Occasional Primary Key Query Speed is Particularly Slow

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: SQL 偶发主键查询速度特别慢

| username: keithyzk

【TiDB Usage Environment】Production environment with 3 TiDB, 3 PD, and 7 TiKV
【TiDB Version】v5.0.4
【Reproduction Path】Primary key query is occasionally very slow, taking several minutes to return results
【Problem Encountered: Phenomenon and Impact】
select from where primary key = ; Occasionally very slow, taking several minutes

【Attachments: Screenshots/Logs/Monitoring】
SQL statement, folder_id is the primary key

SELECT
  dmsfolderd0_.folder_id AS folder1_58_0_,
  dmsfolderd0_.instance_id AS instance2_58_0_,
  dmsfolderd0_.area_id AS area3_58_0_,
  dmsfolderd0_.folder_parentFolderId AS folder4_58_0_,
  dmsfolderd0_.folder_guid AS folder5_58_0_,
  dmsfolderd0_.folder_code AS folder6_58_0_,
  dmsfolderd0_.folder_sortOrder AS folder7_58_0_,
  dmsfolderd0_.folder_name AS folder8_58_0_,
  dmsfolderd0_.folder_path AS folder9_58_0_,
  dmsfolderd0_.folder_size AS folder10_58_0_,
  dmsfolderd0_.folder_maxFolderSize AS folder11_58_0_,
  dmsfolderd0_.folder_alertSize AS folder12_58_0_,
  dmsfolderd0_.folder_maxFileSize AS folder13_58_0_,
  dmsfolderd0_.folder_forbiddenFileExtensions AS folder14_58_0_,
  dmsfolderd0_.folder_childFoldersCount AS folder15_58_0_,
  dmsfolderd0_.folder_childFilesCount AS folder16_58_0_,
  dmsfolderd0_.securityLevel_id AS securitylevel17_58_0_,
  dmsfolderd0_.folder_lockCount AS folder18_58_0_,
  dmsfolderd0_.folder_permStrategy AS folder19_58_0_,
  dmsfolderd0_.folder_verStrategy AS folder20_58_0_,
  dmsfolderd0_.folder_defaultViewId AS folder21_58_0_,
  dmsfolderd0_.folder_state AS folder22_58_0_,
  dmsfolderd0_.folder_ownerId AS folder23_58_0_,
  dmsfolderd0_.folder_createType AS folder24_58_0_,
  dmsfolderd0_.folder_createTime AS folder25_58_0_,
  dmsfolderd0_.folder_createOperator AS folder26_58_0_,
  dmsfolderd0_.folder_modifyTime AS folder27_58_0_,
  dmsfolderd0_.folder_modifyOperator AS folder28_58_0_,
  dmsfolderd0_.folder_type AS folder29_58_0_,
  dmsfolderd0_.folder_offlinePermission AS folder30_58_0_,
  dmsfolderd0_.folder_ExtensionVerifyType AS folder31_58_0_,
  dmsfolderd0_.folder_remark AS folder32_58_0_,
  dmsfolderd0_.folder_modifyOperatorName AS folder33_58_0_,
  dmsfolderd0_.folder_createOperatorName AS folder34_58_0_,
  dmsfolderd0_.folder_deleteTime AS folder35_58_0_,
  dmsfolderd0_.folder_deleteOperator AS folder36_58_0_,
  dmsfolderd0_.folder_purgeTime AS folder37_58_0_,
  dmsfolderd0_.folder_purgeOperator AS folder38_58_0_,
  dmsfolderd0_.folder_isDeleted AS folder39_58_0_,
  dmsfolderd0_.folder_isCascadeDelete AS folder40_58_0_,
  dmsfolderd0_.folder_deleteName AS folder41_58_0_,
  dmsfolderd0_.folder_deleteOperatorName AS folder42_58_0_,
  dmsfolderd0_.folder_purgeOperatorName AS folder43_58_0_,
  dmsfolderd0_.folder_rootFolderId AS folder44_58_0_
FROM
  dms_folder dmsfolderd0_
WHERE
  dmsfolderd0_.folder_id = 28145829;

Analysis during slow occurrence

	id         	task	estRows	operator info                    	actRows	execution info                                                                                                                                   	memory	disk
	Point_Get_1	root	1      	table:dms_folder, handle:28145829	0      	time:11m21.7s, loops:1, Get:{num_rpc:32, total_time:10m40s},tikvRPC_backoff:{num:31, total_time:41.6s},regionMiss_backoff:{num:1, total_time:2ms}	N/A   	N/A

Analysis during normal operation

| username: tidb狂热爱好者 | Original post link

The system has been upgraded to 7.5. TiDB and TiKV are busy, and the RPC time can be very long. The 11 minutes you mentioned is likely due to queuing. See if you can optimize it.

| username: ShawnYan | Original post link

Could you also share the table definition? (with sensitive information removed)

| username: 像风一样的男子 | Original post link

Could you provide a screenshot to see at which specific stage the slow query execution time is stuck?

| username: ShawnYan | Original post link

Consider upgrading to 7.5 in 2024.

| username: keithyzk | Original post link

The table definition should not matter much.
Sorry, I can’t copy it, so I can only take screenshots and piece them together for you to see.

| username: keithyzk | Original post link

Does it mean that upgrading to 7.5 has optimizations for the long rpctime queue time, which might solve this situation?

| username: WalterWj | Original post link

It looks like… encountered Region scheduling, but it doesn’t seem reasonable to take such a long time :thinking:.

| username: h5n1 | Original post link

Are the conditions for these slow SQL queries the same? What happens if you execute them during the day?

| username: 小龙虾爱大龙虾 | Original post link

It’s definitely not an execution plan issue, there are many backoffs. Is there something wrong with the overall cluster?

| username: Kongdom | Original post link

:thinking: Did the decline in table health cause the primary key to not be used? Table read-write hotspot?

| username: tidb狂热爱好者 | Original post link

This person is official.

| username: tidb狂热爱好者 | Original post link

It’s definitely not an execution plan issue, there are a lot of backoffs. Is there something wrong with the overall cluster? I suspect they have a lot of bad SQL queries. All backoff.

| username: keithyzk | Original post link

The slow SQL queries have the same statements except for different primary key values in the WHERE clause. This SQL is part of a scheduled task in our program that runs only at midnight. Therefore, the slow logs only contain records from midnight. However, the performance is inconsistent at midnight, sometimes fast and sometimes slow. When we execute this query during the day, it is very fast.

| username: keithyzk | Original post link

Yes. The query time for this primary key seems too long. Additionally, I have a scheduled task that analyzes all tables with a health score below 80 every day at 1 AM and generates an analyze log. After checking the logs, the scheduled task for updating the health status is executed every day and completes by 1:30 AM. Where can I specifically investigate the cause of this issue? Or, I would like to know if there are targeted optimizations in our version 7.5 that might resolve this issue after an upgrade.

| username: 有猫万事足 | Original post link

How was the overall resource utilization during that time?

It feels like some operations occupied the CPU, IO, and bandwidth. Otherwise, it’s a bit hard to understand why a point query would take 10 minutes.
The information provided is still too little.

| username: tidb狂热爱好者 | Original post link

The meaning of backoff is that my KV is overwhelmed, so please wait. If backoff appears in monitoring, it means you have too many bad SQL queries.

| username: tidb狂热爱好者 | Original post link

There are too many bad SQLs. KV first returns a backoff to you, which may be related to your table data statistics at 1 o’clock. The table health is very poor at 12 o’clock.

| username: xingzhenxiang | Original post link

Let’s first take a look at the CPU and IO situation.

| username: 这里介绍不了我 | Original post link

There are too many slow SQL queries, right?