Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: tidb负载不均衡
During high concurrency stress testing, the load on three TiDB instances is unbalanced. One instance consistently has a CPU usage close to 100%, while the other two are between 50% and 70%.
[TiDB Environment] Production, Testing, Research
[TiDB Version]
[Encountered Problem]
[Reproduction Steps] What operations were performed that led to the issue
[Problem Phenomenon and Impact]
[Attachments]
Please provide the version information of each component, such as cdc/tikv, which can be obtained by executing cdc version/tikv-server --version.
Is it TiDB imbalance or PD imbalance? The title is PD, but the content is TiDB.
If it’s TiDB imbalance, check if the connection numbers are the same. This should be related to your previous load balancing.
The title is incorrect, TiDB load imbalance.
Regarding the number of connections, think about it: one TiDB node has 10 connections, another TiDB node has 5 connections. Of course, one will be busy while the other is not. This imbalance is purely related to the different requests each node is handling. Check the load balancing in front of TiDB and the application.
Are you suggesting checking the load balancing in front of PD in TiDB?
TiDB is not in front of PD.
Application ---> Load Balancer ---> TiDB ---> TiKV
|
v
PD
This is the relationship. The key is to look at your load balancing. PD is just a component for TiDB nodes to check where the region is, allocate TSO, and has nothing to do with which TiDB node the connection hits.
First, check if the number of connections is balanced. If the connections are balanced, it might be that some large SQL queries are hitting this TiDB node, causing high CPU usage. If the connections are not balanced, check if the load balancing configuration strategy is correct.
First, check which processes are consuming the most CPU on the machine where the CPU is close to 100%, then compare the number of connections for each TiDB Server, and then analyze the SQL. Gradually adjust to improve business concurrency and throughput capacity.