How to adjust parameters to distribute load evenly in a TiDB cluster when the load balancing is uneven and all the load is directed to one machine?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb集群负载均衡不平均,会全部往一台机器上负载,如何调整参数能分摊呢?

| username: TiDBer_vJpITQ5J

The TiDB cluster load balancing is uneven, with all the load going to one machine. How can I adjust the parameters to distribute it?

The cluster architecture is: 3 PD, 3 TiDB, 3 TiKV, 1 TiFlash. During production use, suddenly all the load was distributed to the third machine, causing the CPU to spike to around 90%, while the CPUs of the other two servers remained normal. All three TiKV instances automatically restarted. Should I set the ports of the 3 TiDB instances in the configuration file for connecting to the database to distribute the load?

| username: magic | Original post link

We use HAProxy for TiDB load balancing.
Documentation: HAProxy 在 TiDB 中的最佳实践 | PingCAP 文档中心

| username: wzf0072 | Original post link

The data is just starting to be inserted and is still in the warming-up phase. The load might be unbalanced.

| username: TiDBer_vJpITQ5J | Original post link

I created 2000 login accounts, which directly caused three TiKV nodes to restart and the CPU usage to spike. I’m not sure if the configuration with HAProxy mentioned by the expert above will work, but I’ll set it up and give it a try.

| username: TiDBer_vJpITQ5J | Original post link

I’ll give it a try and see the effect.

| username: wzf0072 | Original post link

TiDB High Concurrency Write Best Practices: Methods to Avoid Hotspot Issues

| username: 考试没答案 | Original post link

You can also set up an SLB in a cloud environment. Hardware F5.

| username: 考试没答案 | Original post link

Hardware may not fully achieve load balancing. As for software, don’t pursue complete load balancing.

| username: maokl | Original post link

Nginx can achieve balanced connections to computing nodes using round-robin scheduling.

| username: tidb菜鸟一只 | Original post link

You have a mixed deployment, right? I see you only have three servers, with 3 PD, 3 TiDB, 3 TiKV, and 1 TiFlash nodes deployed?

| username: TiDBer_vJpITQ5J | Original post link

What is hybrid deployment?

| username: tidb菜鸟一只 | Original post link

Which nodes have you deployed on each server?

| username: ljluestc | Original post link

Adjust TiDB load balancing settings: TiDB has built-in load balancing settings that can help distribute the load among available nodes. You can adjust these settings by modifying the config.toml configuration file for each TiDB node. Specifically, you can adjust the max-connection-count and server.max-batch-size parameters to control the number of client connections and the size of data batches processed by each node.

Adjust TiKV load balancing settings: TiKV also has built-in load balancing settings that can help distribute the load among available nodes. You can adjust these settings by modifying the tikv.toml configuration file for each TiKV node. Specifically, you can adjust the readpool.coprocessor.xxx parameters to control the number of concurrent coprocessors and the queue size for each node.

Monitor cluster metrics: You can use monitoring tools like Prometheus or Grafana to monitor cluster metrics and identify any imbalances or bottlenecks. Specifically, you can monitor CPU and memory usage, disk I/O, network traffic, and other key metrics for each node to determine which nodes are under the most load.

Consider scaling the cluster: If adjusting load balancing settings and parameters does not resolve the issue, you may need to consider scaling the cluster by adding more nodes or increasing the resources of existing nodes. You can use tools like TiUP to manage the scaling process and ensure it is done safely and efficiently.

In addition to these steps, you may also want to investigate why all the traffic suddenly shifted to one machine, causing the CPU spike. This could be due to various factors such as network issues, software bugs, or unexpected traffic surges. Investigating the root cause of the issue can help prevent it from happening again in the future.

| username: dba-kit | Original post link

It seems that it might be caused by a hotspot in TiKV. Check if the table structure design is unreasonable and if the queries are concentrated on a few regions?

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.