TiDB Query Slow

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb查询慢

| username: TiDBer_iQ93gqWn

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version]
V5.0.0
[Reproduction Path] Operations performed that led to the issue
[Encountered Issue: Issue Phenomenon and Impact]
An issue occurred where business SQL that used to return in milliseconds now takes around 3 seconds. System query SQL is slow, with tp999 generally exceeding 2 seconds.
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page

[Attachments: Screenshots/Logs/Monitoring]
Internal System SQL


Business SQL

| username: 小龙虾爱大龙虾 | Original post link

Internal SQL does not need to be managed.

| username: DBAER | Original post link

System table queries are slow, generally you don’t need to worry about it. I remember that higher versions have optimized this, so system table queries are slow.

| username: tidb菜鸟一只 | Original post link

Are both TiDB-server nodes slow? It looks like your load balancing is very uneven…

| username: TiDBer_iQ93gqWn | Original post link

Business SQL situation


The host’s performance indicators also seem fine
image

| username: TiDBer_iQ93gqWn | Original post link

Using HAProxy to load balance between two TiDB instances, I restarted one TiDB node the day before yesterday. The node that was restarted, tp999, became normal, but the other tp999 node that wasn’t restarted started having issues again. I’m wondering if the requests were redirected to the other node due to the restart, and the restart didn’t solve the problem.

After restarting TiDB:

TiDB that wasn’t restarted:

| username: TiDBer_嘎嘣脆 | Original post link

You should test whether your proxy load balancing is configured evenly.

| username: TiDBer_iQ93gqWn | Original post link

Using the HAProxy configuration from the official documentation

listen tidb-cluster # Configure database load balancing.
bind 0.0.0.0:3306 # Floating IP and listening port.
mode tcp # HAProxy should use the 4th layer transport layer.
balance leastconn # The server with the fewest connections receives the connection first. leastconn is recommended for long-session services such as LDAP, SQL, TSE, etc., rather than short-session protocols like HTTP. This algorithm is dynamic, and the server weight will be adjusted during operation for slow-starting servers.
server tidb-1 192.168.1.2:4000 check inter 2000 rise 2 fall 3 # Check port 4000, with a check frequency of once every 2000 milliseconds. If 2 checks are successful, the server is considered available; if 3 checks fail, the server is considered unavailable.
server tidb-2 192.168.1.3:4000 check inter 2000 rise 2 fall 3

| username: tidb菜鸟一只 | Original post link

But from the execution plan of your SQL, it doesn’t seem to have any major issues. On which TiDB server did you execute this? You can find a slow SQL from the dashboard and then execute it on both TiDB servers separately to see if it’s a problem with the TiDB server.

| username: 友利奈绪 | Original post link

Apart from the system SQL, everything else is fine.

| username: TiDBer_iQ93gqWn | Original post link

There are other issues as well :joy: but they only occur occasionally. After restarting one of the tp999 over 2s alarms, the situation has somewhat improved.

| username: TiDBer_iQ93gqWn | Original post link

I tried executing with 2 TiDB instances manually, and both were very fast.

| username: Jasper | Original post link

What is the overall QPS of the cluster? Are there fluctuations in the 99th and 80th percentile durations?

| username: RyanHowe | Original post link

Could you provide a more complete execution info of the business SQL’s execution plan? Then check if there is any significant time consumption in the coprocessor reads and transactions under the dashboard. Also, ensure that the statistics of your TiDB cluster are up-to-date and try executing analyze.

| username: miya | Original post link

If it only happens occasionally, is it possible that it is caused by a sudden spike in traffic, creating a hotspot? You can analyze the pattern, check the specific business scenarios where the SQL is used, and conduct a detailed analysis.