TiDB Load Imbalance

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb负载不均衡

| username: TiDBer_2CPYnSMQ

During high concurrency stress testing, the load on three TiDB instances is unbalanced. One instance consistently has a CPU usage close to 100%, while the other two are between 50% and 70%.

[TiDB Environment] Production, Testing, Research
[TiDB Version]
[Encountered Problem]
[Reproduction Steps] What operations were performed that led to the issue
[Problem Phenomenon and Impact]

[Attachments]

Please provide the version information of each component, such as cdc/tikv, which can be obtained by executing cdc version/tikv-server --version.

| username: TiDBer_jYQINSnf | Original post link

Is it TiDB imbalance or PD imbalance? The title is PD, but the content is TiDB.
If it’s TiDB imbalance, check if the connection numbers are the same. This should be related to your previous load balancing.

| username: TiDBer_2CPYnSMQ | Original post link

The title is incorrect, TiDB load imbalance.

| username: TiDBer_jYQINSnf | Original post link

Regarding the number of connections, think about it: one TiDB node has 10 connections, another TiDB node has 5 connections. Of course, one will be busy while the other is not. This imbalance is purely related to the different requests each node is handling. Check the load balancing in front of TiDB and the application.

| username: TiDBer_2CPYnSMQ | Original post link

Are you suggesting checking the load balancing in front of PD in TiDB?

| username: TiDBer_jYQINSnf | Original post link

TiDB is not in front of PD.

Application ---> Load Balancer ---> TiDB ---> TiKV
                       |
                       v
                       PD                 

This is the relationship. The key is to look at your load balancing. PD is just a component for TiDB nodes to check where the region is, allocate TSO, and has nothing to do with which TiDB node the connection hits.

| username: 啦啦啦啦啦 | Original post link

First, check if the number of connections is balanced. If the connections are balanced, it might be that some large SQL queries are hitting this TiDB node, causing high CPU usage. If the connections are not balanced, check if the load balancing configuration strategy is correct.

| username: alfred | Original post link

First, check which processes are consuming the most CPU on the machine where the CPU is close to 100%, then compare the number of connections for each TiDB Server, and then analyze the SQL. Gradually adjust to improve business concurrency and throughput capacity.