High CPU Usage on a Single TiDB Server Node

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidbserver单个节点cpu使用很高

| username: Jolyne

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version]
[Reproduction Path] What operations were performed that caused the issue
The CPU of a single tidbserver in the cluster is very high. We have set up a haproxy proxy, which should theoretically balance the load.
[Encountered Issue: Problem Phenomenon and Impact]
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

| username: Jasper | Original post link

Are the connections balanced? First, confirm whether HA is working properly and whether the connections are evenly distributed to the two TiDB servers.

| username: 像风一样的男子 | Original post link

From topsql, it’s very clear that these few SQL statements are definitely a bit slow.

| username: Jolyne | Original post link

It is balanced.

| username: Jolyne | Original post link

There is a doubt, why do the slow SQL queries all go through one of the machines? Logically, the other machine should also have slow SQL queries, and the time would be long. I extended the time, but still, only one machine has high usage.

| username: Jasper | Original post link

That should be the case for most SQL queries being similar, but these few select queries scan a large amount of data, occupying more CPU, and are only executed on the 55 tidb-server. You can use slow queries to find out where this SQL is initiated from, whether there is a separate connection that did not go through haproxy, but instead directly connected to the 55 tidb-server to execute some large SQL queries.

| username: Jolyne | Original post link

The number of slow SQL queries found is about the same.

According to the digest, there are 186 similar slow queries. Is this considered a lot?

| username: 大飞哥online | Original post link

Let’s start with handling slow SQL.

| username: 有猫万事足 | Original post link

Connections are balanced, but the workload within those connections may not be the same. For example, if several services are using one cluster, service a and b might execute one SQL query every 10 minutes, while service c might execute 500 SQL queries per second. After load balancing through HA, the connections to each TiDB instance are the same, but it is possible that all connections for service c are concentrated on one instance.

When using HA for load balancing, this situation needs to be considered. Services with similar workloads can share one HA, but if there is a significant difference in workloads, achieving balance might require splitting into two HAs. For instance, by setting up two HAs with the same backend configuration, services a and b can use HA1, and service c can use HA2 to achieve balance.

| username: Jasper | Original post link

Can you check according to the digest if the corresponding SQL is all executed on 55?

| username: zhanggame1 | Original post link

HAProxy only balances the number of connections and does not balance the load. If everything is on one machine, check if it is using long connections.

| username: 昵称想不起来了 | Original post link

It feels like all the long connections with slow queries are connected to that node.