TiDB Performance Suddenly Drops Significantly at a Certain Moment, All Select Request Latencies Increase

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB性能在某一时刻突然急剧下降,所有Select请求延时变高

| username: TiDBer_2tTuI9A5

[TiDB Usage Environment] Production Environment
[TiDB Version] 4.0.9
[Reproduction Path]
It’s not easy to reproduce because, according to the logs, the task volume of select requests at the time of the issue is similar to other times, with no sudden surge in business activity.

[Encountered Issue: Problem Phenomenon and Impact]
Impact: Select performance drops sharply, even though the selects are indexed. Generally, scanning 100 rows with an index takes about 0.5 seconds.
Phenomenon: Monitoring shows that the CPU of one TiKV node spikes sharply, the qps of kv_batch_get increases sharply, and the tikv_unified_read_pool_running_tasks metric on the high-CPU TiKV node suddenly increases.

After about 5 minutes, the cluster returns to normal, and the monitoring metrics return to normal without any other operations in between.

[Resource Configuration]
Physical machine, ample resources
[Attachments: Screenshots/Logs/Monitoring]

If more detailed monitoring is needed for further investigation, please reply to me promptly. Thank you, everyone.

| username: h5n1 | Original post link

First, check the slow SQL during this period.

| username: WalterWj | Original post link

The version is very outdated, I recommend upgrading.

| username: TiDBer_jYQINSnf | Original post link

Is your business SQL fixed? Is there any new business going online? Has it brought new unoptimized SQL?

| username: TiDBer_2tTuI9A5 | Original post link

There was no upgrade, it just happened suddenly this morning.

| username: TiDBer_2tTuI9A5 | Original post link

Hello, I checked the slow SQL queries, and they were all normal SQL queries before, with no issues at other times… But during the problematic period, the Cop time was unusually high. It usually takes 0.5s to resolve, but it can spike to 18-20s.

| username: Kongdom | Original post link

Take a look at the statement analysis on the dashboard to see if there are any statements that have been executed an unusually high number of times within a certain time period. It seems like there might be a hotspot.

| username: TiDBer_jYQINSnf | Original post link

Take a look at the overall QPS, has it increased or decreased? One of the TiKV’s gRPC has increased a lot. If the overall QPS hasn’t increased, that’s a bit abnormal.

| username: h5n1 | Original post link

If the SQL hasn’t changed, check if the execution plan has changed. Post some SQL information and check the slow.log.

| username: TiDBer_2tTuI9A5 | Original post link

The overall QPS on the monitoring has increased, but it should be due to the slower processing speed leading to the increase in metrics. According to the business logs, the QPS hasn’t changed much.

| username: TiDBer_jYQINSnf | Original post link

It is probably due to a change in the SQL execution plan that caused the table scan. Seeing that the coprocessor response is so slow, it is likely due to the large scan range. Check the SQL that you said was usually fast but is now slow, and see if the execution plan meets expectations. If it is caused by a change in the execution plan, look for the binding method in the official manual.

| username: ffeenn | Original post link

Sometimes it may not necessarily be caused by slow SQL. Is there any anomaly in the Overview panel?

| username: 裤衩儿飞上天 | Original post link

Your QPS spiked at that time :face_with_raised_eyebrow:

| username: db_user | Original post link

Take a look at analyze-related;
show analyze status;
Check if there are any failure statuses.
show variables like ‘%analyze%’ to see the time period for automatic analyze;
Search for the keyword “data too long” in tidb.log to see if there are any tables with such occurrences.

| username: tony5413 | Original post link

I didn’t see the logs.