Connection Anomalies Under High Concurrency

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 大并发下的连接异常

| username: smily

The underlying database of a certain online teaching system uses TiDB (6.1). Before class, students scan a code to check in (which requires accessing multiple databases), and it was found that the connection time was excessively long, resulting in many login timeouts. Through full-link analysis, it was observed that connecting to TiDB took an excessively long time. Due to the urgency of the situation, some databases were migrated to MySQL. After reopening the system, connections were normal after migrating to MySQL, but those remaining in TiDB still experienced connection timeouts.

The following checks were performed on TiDB:

  1. Host resources and system parameters are configured according to official recommendations.
  2. Load balancing strategy is set to least connections.
  3. The number of compute nodes was expanded to 5.
  4. There are no issues with network firewalls, database auditing, or bandwidth.
  5. Unnecessary test data has been cleaned up.
  6. Optimized TiDB performance parameters:
    • tidb_mem_quota_query is set to 4G;
    • max_execution_time is set to 0;
    • tidb_mem_oom_action is set to CANCEL;
    • server-memory-quota is set to 32G;
    • tidb_replica_read is set to leader-and-follower;
    • max-server-connections is set to 0;
    • token-limit is set to 1000;
    • max-procs is set to 0;
  7. The database disk throughput reaches 4000 or above; database read/write performance was tested using sysbench.
  8. There are no relevant errors in the TiDB and TiKV logs.

I would like to ask everyone, what other directions should be investigated? Thank you!

| username: forever | Original post link

  1. When the issue occurs, is the database resource usage very high?
  2. Have you tried operating from the tiup client, connecting to the load balancing IP and the TiDB IP separately to determine if the load balancer is the problem?
| username: alfred | Original post link

  1. You can post the logs of the application layer timeout errors for us to take a look.
  2. You can post the system load status, such as CPU, memory, and IO monitoring.
| username: smily | Original post link

Checked the hotspots, and they seem fine.

Looking at the number of DB connections, it’s not too exaggerated.

The instantaneous traffic is still relatively large.

CPU usage, TiDB is significantly higher.
image

| username: smily | Original post link

The reply is a bit disorganized, please understand. :sweat_smile:

| username: smily | Original post link

Please give me some guidance~

| username: yilong | Original post link

  1. It looks like there are two tidb-servers, why is the CPU usage of one significantly higher than the other? Is the front-end load balancing configured properly?
  2. For TiKV, it seems that the max usage is also higher for two specific instances. It’s unclear whether this is at the same time or at different times. If it’s at a specific time and one TiKV instance is higher, it could be a hotspot. You can check whether it’s a read hotspot or a write hotspot and see if it can be dispersed or split.
| username: smily | Original post link

I also suspected a hotspot and checked it, but it was fine. However, what you mentioned about the uneven load on the two tidb-servers reminded me, thank you.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.