Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: TIDB集群出现类似突然“宕机”的现象,全部请求超高延迟
【TiDB Usage Environment】Production Environment
【TiDB Version】7.1.0
【Reproduction Path】
Occasional, unable to reproduce stably
【Encountered Problem: Phenomenon and Impact】
Suddenly experienced extremely high latency between 8:00-9:42, making the cluster almost unusable
Before the crash, one of the tidb-server instances had significantly higher memory usage than other nodes
Observed another PD anomaly:
【Resource Configuration】
【Attachments: Screenshots/Logs/Monitoring】
【Recovery Method】
Recovered after restarting all tidb-server instances
Please provide detailed cluster configuration information and the specific error context…
Are there any logs or error messages?
Could you please provide the overall monitoring?
Please send a screenshot from the dashboard, focusing on TopSQL to see which SQL statements are consuming the most resources.
It is probably a resource utilization issue.
This… there’s nothing here, you’d need to be a fortune teller to solve the problem. 
The image hasn’t been posted yet, please help take a look again.
Separating PD and TiDB seems like OOM is affecting PD.
Check the TiDB logs, did it restart due to OOM?
TiDB did not encounter an OOM (Out of Memory) issue. The system has 512GB of memory, and there are no memory restrictions on TiDB.
Originally, PD was with TiKV. After encountering issues, it was migrated to be with TiDB-server. However, the CPU, memory, and disk performance are very sufficient.
TiDB didn’t handle the load?
Doesn’t it show here that one node is connecting and disconnecting intermittently? Check if most of the load is on this node…
The first image in the top left corner shows that a TiDB node is frequently restarting. Focus on checking the situation of that machine.
The initial speculation is that the PD and TiDB are mixed on the same machine, and this TiDB is squeezing the machine’s resources or affecting some related content, which in turn affects the PD node service. Coincidentally, this PD was the original PD leader, ultimately leading to an impact on the cluster.
You can follow this line of thought to investigate and confirm the issue, and see if there is any further information.
Using HAProxy for load balancing, the connection count appears to be balanced, and the QPS is also relatively balanced. However, there are still instances where the load and memory of a random node gradually increase, causing the QPS of that node to slowly drop to an abnormally low level.
When the issue occurred, PD was deployed together with TiKV, and separately from the TiDB-server. Before taking the screenshot, PD and TiDB-server were deployed together because the TiDB-server had ample and idle resources.
So originally PD and TiKV were deployed together? After encountering issues, you moved PD to the TiDB server? The original topology indeed had problems. Let’s see if the issue still occurs with the current setup.
There is still a phenomenon where the memory of a certain tidb-server gradually increases, leading to a gradual increase in latency. The number of connections and QPS across each node are basically balanced.
Looking at TiDB uptime, it seems it hasn’t crashed based on the uptime duration, which continues to increase. Is it possible that high (abnormal) load is causing the retrieval failure?
Check the historical monitoring curves to see the load and usage conditions.
The memory of the tidb-server is gradually increasing, while other tidb-servers remain unchanged. This situation is usually caused by a large SQL query slowly reading a large amount of data.
At this time, check the slow query and tidb.log logs, look for the content of the expensive query keyword during the problem period, and then analyze these types of SQL queries accordingly.