TiDB Server connections are balanced, but CPS distribution and CPU usage are uneven

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB Server 连接均匀,但是CPS分布不均匀以及CPU不均匀

| username: 开心大河马

【TiDB Usage Environment】Production Environment
【TiDB Version】
【Reproduction Path】No obvious variables changed, just an increase in access volume
【Encountered Problem: Phenomenon and Impact】
Only one TiDB server’s CPU is relatively high, always spiking, 20% higher than other nodes.
【Resource Configuration】
All TiDB servers are 16 core 32GB virtual machines, TiKV are physical machines.

Harproxy monitoring: TiDB server-37’s connections look similar to other nodes.

The CPS of the TiDB server shows a difference:

The CPU shows a significant difference, almost double:

I just want to ask, how to look at this, why is there such a big difference?

| username: 有猫万事足 | Original post link

You can try enabling TopSQL to see if the TopSQL of this TiDB instance is significantly different from the other instances. If there is a significant difference, consider it from the perspective of HAProxy sharding. If there is no significant difference, then you need to consider the hardware aspects of this instance.

| username: 开心大河马 | Original post link

The CPU usage of topsql is much higher at 37 compared to other nodes. Just like other normally running SQLs, it is also higher than others, and there is an additional SQL with 9%.

At 37:

Other nodes:

| username: 开心大河马 | Original post link

Is it inherently uneven? The underlying command seems to show a disparity, with this node having more than the others.

| username: 有猫万事足 | Original post link

The SQL type of server 37 doesn’t seem to differ much from other servers. The top 3 are almost the same. The cumulative CPU time is longer compared to other servers, so I still lean towards the idea that server 37 is inherently slower.

For each type of SQL selected, there is also the following information below:

Check the call/sec and latency/call for the same type of SQL and compare it with other servers. If the call/sec is similar but the latency/call is longer, it indicates that server 37 is slower in executing the same tasks.

If the call/sec is already significantly higher than other servers, it suggests that haproxy is assigning more SQL tasks to this machine. Alternatively, consider whether some applications are accessing this TiDB without going through haproxy.

| username: 开心大河马 | Original post link

Currently, it seems that this node has some issues. The SQL being run and the execution plans are consistent, but its call value is significantly higher, with an additional 1ms latency. It’s not clear yet if there’s a problem with request distribution; haproxy appears to distribute connections evenly, not based on active connections. We tried restarting the TiDB server on this node yesterday, but the situation remains the same, with this node still having higher values. We will check the host layer to see if there are any issues.

Node 37: highest CPU usage node.
First SQL: sqlid: 7b78200c38238d611ea8bfc1be449427d69e8360d832127fdab452f6ffb4cc3d
        call/sec: 266.6, latency/call: 6.2ms
Second SQL: sqlid: 3a725d4f20843ff3eb0ff8fdeb3b7df463071ba5e31db40b26c2cc6c04b68cb7
       call/sec: 266.8, latency/call: 6.1ms
Node 35: average CPU usage node
First SQL: sqlid: 7b78200c38238d611ea8bfc1be449427d69e8360d832127fdab452f6ffb4cc3d
          call/sec: 206.4, latency/call: 5.1ms
Second SQL: sqlid: 3a725d4f20843ff3eb0ff8fdeb3b7df463071ba5e31db40b26c2cc6c04b68cb7
          call/sec: 195.2, latency/call: 5.3ms
Node 36: lower CPU usage node
First SQL: sqlid: 7b78200c38238d611ea8bfc1be449427d69e8360d832127fdab452f6ffb4cc3d
        call/sec: 127.4, latency/call: 5.0ms
Second SQL: sqlid: 3a725d4f20843ff3eb0ff8fdeb3b7df463071ba5e31db40b26c2cc6c04b68cb7
       call/sec: 116.3, latency/call: 5.1ms
| username: 有猫万事足 | Original post link

From this perspective, it really seems like 37 has too many tasks assigned.

You can check the INFORMATION_SCHEMA.CLUSTER_PROCESSLIST table to see if there are any changes in the host connected to each server.
If all connections are coming from haproxy and the proxy protocol is not enabled on haproxy, the HOST field should all be the address of haproxy.

At this point, it can be basically confirmed that it is a haproxy allocation issue, or there are other applications that are directly connecting to 37 without going through haproxy.

| username: 开心大河马 | Original post link

At present, 37 has been moved to another host machine. The CPU consumption of 37 is indeed higher than other nodes. After moving it twice to the host machine with the lowest load, the CPU usage has decreased, but the CPS has not decreased. Later, when checking top SQL, the call/sec is still higher than other nodes.

It might be because it’s a low peak period, and the difference is not too significant at the moment. Also, although there are more call/sec, the latency/call is consistent with other nodes, around 5ms.

As for haproxy, we only opened the total entry VIP to the outside. Currently, each application is basically present on each node. What we can see now is that the total number of connections displayed by haproxy for each TiDB server is consistent, but the concurrent manual checks at each node are not the same at irregular intervals. 37 is still relatively high, and it’s unclear why it is higher. There is only one user (besides root).

select instance, substring(host, 1, length(host) - length(substring_index(host, ':', -1)) - 1) as host, count(substring(host, 1, length(host) - length(substring_index(host, ':', -1)) - 1)) as countsum 
from information_schema.cluster_processlist 
where info is not null and user='xxxx' 
group by instance, substring(host, 1, length(host) - length(substring_index(host, ':', -1)) - 1) 
order by instance 
limit 100;

| username: 有猫万事足 | Original post link

From the best practices of haproxy, this balancing strategy is indeed mainly to ensure connection balance.
I guess that TiDB has many connections, but the connections from 64, 63, and 90 are relatively busy, while the connections from other hosts may be relatively idle.
So under this allocation strategy, the connections are balanced but the load is not balanced.
If resources permit, you can try to let the busier services use one set of haproxy, and the idle ones use another set of haproxy, which may become more balanced.

I also attached the haproxy documentation, you can also look for other better balancing strategies.

| username: 开心大河马 | Original post link

Thank you very much. I will continue to observe and see. Your solution is very clear and has handled my problem well. :+1::+1::+1:

We are using HAProxy 2.6, and the specific balance configuration path is here: I will check and test it myself.
https://docs.haproxy.org/2.6/configuration.html#4.2

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.