Inconsistent Performance of Primary Key Queries in TiDB

translator_bot · June 22, 2024, 2:54am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb主键查询时快时慢

| username: cdpeng

【TiDB Usage Environment】Production environment
【TiDB Version】v4.0.8
【Encountered Problem: Phenomenon and Impact】
In the online TiDB cluster, the same SQL statement, primary key query, has only one execution plan. The table health is 97, and the normal query time is in milliseconds, but it suddenly reaches 5-6 seconds, and sometimes even more than 10 seconds.
【Resource Configuration】CPU: 56 cores, 192G memory, NVMe disk, 70 TiKV nodes
【Attachments: Screenshots/Logs/Monitoring】
Normal execution plan

Execution plan when execution time is long:

Please advise, experts.

translator_bot · June 22, 2024, 2:54am

| username: h5n1 | Original post link

In the dashboard, look for the actual execution plan, execution time, and transaction statistics for slow SQL queries.

translator_bot · June 22, 2024, 2:54am

| username: cdpeng | Original post link

This cluster does not have the dashboard enabled. The execution plan shown in the image is the actual execution plan taken from the slow query log.

translator_bot · June 22, 2024, 2:54am

| username: Jellybean | Original post link

Check the time period when the query slows down and see if there are any differences in the cluster monitoring. First, identify whether it’s a TiDB or TiKV issue, and then continue investigating.

translator_bot · June 22, 2024, 2:54am

| username: Jellybean | Original post link

Looking at the execution plan you posted, there were multiple backoff retries in TiKV during loops3 execution, causing the latency to increase from 194ms to 5s. This is where the access delay increased. The direction for troubleshooting is to check the situation of TiKV at that time. You can analyze whether there were network delays, GC, lock conflicts, write hotspots, etc., in the cluster.

translator_bot · June 22, 2024, 2:54am

| username: tidb菜鸟一只 | Original post link

tikvRPC_backoff: After an RPC request fails, it will retry after waiting for the backoff time. This includes the type of backoff (such as regionMiss, tikvRPC), the total backoff waiting time (total_time), and the total number of backoffs (num).
You can check the TiDB and TiKV logs to see if there are any read-write conflicts during the corresponding time period.

translator_bot · June 22, 2024, 2:54am

| username: ealam_小羽 | Original post link

You can also check the I/O; I have encountered issues caused by I/O fluctuations before.

translator_bot · June 22, 2024, 2:54am

| username: zhanggame1 | Original post link

In this situation, I think there might be packet loss in the network. Are there similar phenomena with other statements?

translator_bot · June 22, 2024, 2:54am

| username: redgame | Original post link

Another approach is that the data distribution is uneven.

translator_bot · June 22, 2024, 2:54am

| username: tony5413 | Original post link

Execution plan comparison, resource comparison, IO, locks (QPS, database pressure situation)

translator_bot · June 22, 2024, 2:54am

| username: cy6301567 | Original post link

If the machine performance is not an issue, check if it’s a data hotspot problem. Is the ID random or auto-incrementing? The region data might not be distributed across the KV nodes.

translator_bot · June 22, 2024, 2:54am

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.