The same query has a significant difference in execution time on different TiDB server nodes

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 同样的查询在不同的tidb server节点上查询时间差异很大

| username: 重启试试

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.5.2
[Reproduction Path] What operations were performed when the issue occurred
[Encountered Issue: Issue Phenomenon and Impact]
[Resource Configuration] Enter TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

Currently, each of the 3 physical machines is deployed with one TiDB, one PD, and one TiKV


Resource Configuration Information

Instance Information


The current node queries are much slower than the other two nodes


Checked the execution plan, and it is the same. How should I follow up to troubleshoot and locate the issue?

| username: h5n1 | Original post link

Check the explain analyze and trace results of different nodes.

| username: caiyfc | Original post link

You need to look at the actual execution plan. If 201 is executed for the first time, the result might be cached, and then when 202 and 203 are executed, they directly query from the cache, so they are much faster. The execution plan shown by explain analyze includes cache hit rates, which you can check.

| username: 我是咖啡哥 | Original post link

Are the results from multiple executions?
Also, check the leader region distribution for this table?

| username: 重启试试 | Original post link

Trace results



| username: 重启试试 | Original post link

The execution plan, if 201 is executed for the first time, the result might be cached, and then when 202 and 203 are executed, it will be direct.

Executed many times in no particular order. 201 is much slower than nodes 202 and 203.

| username: 重启试试 | Original post link

Yes, I executed it many times and the results are the same.

| username: h5n1 | Original post link

Looking at the trace, the parsing time for 201 is longer, and looking at the explain analyze, the index lookup operator time for 201 is longer. The time difference between the two results is quite significant. The hardware configuration of these nodes should be the same, right? Model type, whether NUMA binding is enabled, and whether CPU power-saving mode is turned off.

| username: 重启试试 | Original post link

The hardware configuration is the same, including the model and type, and whether NUMA binding is the same. How do you check the status of the CPU power-saving mode?

| username: redgame | Original post link

It’s the cross-node query that’s slow, right?

| username: h5n1 | Original post link

| username: 重启试试 | Original post link

All three machines are performance.

| username: zhanggame1 | Original post link

How about restarting the slow TiDB?

| username: h5n1 | Original post link

Is the current cluster version upgraded?

| username: 重启试试 | Original post link

Do you mean the TiDB server node that is slow to restart?

| username: 重启试试 | Original post link

Yes, upgraded from version 5.

| username: h5n1 | Original post link

select version(); Check the version of each TiDB.
| username: 重启试试 | Original post link

The results on the 3 tidbservers are the same.

| username: h5n1 | Original post link

Compare the differences between the three TiDB variables (show variables) and parameters (information_schema.CLUSTER_CONFIG type=‘tidb’).

| username: 重启试试 | Original post link

Apart from information like hostname and IP, everything else is the same.