TiDB Cluster Experiences Sudden "Crash-Like" Phenomenon, All Requests Have Extremely High Latency

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIDB集群出现类似突然“宕机”的现象,全部请求超高延迟

| username: TiDBer_cCxPj52F

【TiDB Usage Environment】Production Environment
【TiDB Version】7.1.0
【Reproduction Path】
Occasional, unable to reproduce stably

【Encountered Problem: Phenomenon and Impact】
Suddenly experienced extremely high latency between 8:00-9:42, making the cluster almost unusable



Before the crash, one of the tidb-server instances had significantly higher memory usage than other nodes

Observed another PD anomaly:

【Resource Configuration】
【Attachments: Screenshots/Logs/Monitoring】

【Recovery Method】
Recovered after restarting all tidb-server instances

| username: xfworld | Original post link

Please provide detailed cluster configuration information and the specific error context…

| username: heiwandou | Original post link

Are there any logs or error messages?

| username: Miracle | Original post link

Could you please provide the overall monitoring?

| username: 像风一样的男子 | Original post link

Please send a screenshot from the dashboard, focusing on TopSQL to see which SQL statements are consuming the most resources.

| username: Fly-bird | Original post link

It is probably a resource utilization issue.

| username: 有猫万事足 | Original post link

This… there’s nothing here, you’d need to be a fortune teller to solve the problem. :joy:

| username: TiDBer_cCxPj52F | Original post link

The image hasn’t been posted yet, please help take a look again.

| username: 芮芮是产品 | Original post link

Separating PD and TiDB seems like OOM is affecting PD.

| username: 像风一样的男子 | Original post link

Check the TiDB logs, did it restart due to OOM?

| username: TiDBer_cCxPj52F | Original post link

TiDB did not encounter an OOM (Out of Memory) issue. The system has 512GB of memory, and there are no memory restrictions on TiDB.

| username: TiDBer_cCxPj52F | Original post link

Originally, PD was with TiKV. After encountering issues, it was migrated to be with TiDB-server. However, the CPU, memory, and disk performance are very sufficient.

| username: tidb菜鸟一只 | Original post link

TiDB didn’t handle the load?


Doesn’t it show here that one node is connecting and disconnecting intermittently? Check if most of the load is on this node…

| username: Jellybean | Original post link

The first image in the top left corner shows that a TiDB node is frequently restarting. Focus on checking the situation of that machine.

The initial speculation is that the PD and TiDB are mixed on the same machine, and this TiDB is squeezing the machine’s resources or affecting some related content, which in turn affects the PD node service. Coincidentally, this PD was the original PD leader, ultimately leading to an impact on the cluster.

You can follow this line of thought to investigate and confirm the issue, and see if there is any further information.

| username: TiDBer_cCxPj52F | Original post link

Using HAProxy for load balancing, the connection count appears to be balanced, and the QPS is also relatively balanced. However, there are still instances where the load and memory of a random node gradually increase, causing the QPS of that node to slowly drop to an abnormally low level.

| username: TiDBer_cCxPj52F | Original post link

When the issue occurred, PD was deployed together with TiKV, and separately from the TiDB-server. Before taking the screenshot, PD and TiDB-server were deployed together because the TiDB-server had ample and idle resources.

| username: tidb菜鸟一只 | Original post link

So originally PD and TiKV were deployed together? After encountering issues, you moved PD to the TiDB server? The original topology indeed had problems. Let’s see if the issue still occurs with the current setup.

| username: TiDBer_cCxPj52F | Original post link

There is still a phenomenon where the memory of a certain tidb-server gradually increases, leading to a gradual increase in latency. The number of connections and QPS across each node are basically balanced.




| username: TiDBer_cCxPj52F | Original post link

Looking at TiDB uptime, it seems it hasn’t crashed based on the uptime duration, which continues to increase. Is it possible that high (abnormal) load is causing the retrieval failure?

| username: Jellybean | Original post link

Check the historical monitoring curves to see the load and usage conditions.
The memory of the tidb-server is gradually increasing, while other tidb-servers remain unchanged. This situation is usually caused by a large SQL query slowly reading a large amount of data.
At this time, check the slow query and tidb.log logs, look for the content of the expensive query keyword during the problem period, and then analyze these types of SQL queries accordingly.