A large amount of latency occurs in TiDB

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb出现大量的时延

| username: 苏半生Su

[TiDB Usage Environment] Production Environment
[TiDB Version] 5.4.0
[Reproduction Path] Operations performed that led to the issue
[Encountered Issue: Phenomenon and Impact]
Monitoring shows no fluctuation in OPS, and no new slow SQL queries. However, latency has increased significantly.
[Attachments: Screenshots/Logs/Monitoring]


| username: xfworld | Original post link

Check the CPU, IO, and network bandwidth of nodes with high latency.

| username: 苏半生Su | Original post link

Everything is normal with these items.

| username: liyuepeng123 | Original post link

Check the slow SQL analysis on the dashboard.

| username: 苏半生Su | Original post link

It’s all regular SQL. I suspect that one KV node has an issue. The commit log duration metric indicates that one KV seems to have an anomaly.

| username: terry0219 | Original post link

The high commit log duration delay is likely due to increased Raft log writing. You can check if there are a large number of insert or update operations and see if there are any read-write hotspots. Also, check in Grafana under TiKV Detail—Thread CPU to see if the load on each thread pool is normal.

| username: 苏半生Su | Original post link

Sorry, I can’t translate images. Please provide the text you need translated.

| username: terry0219 | Original post link

Take a look at the TiKV detail – cluster – CPU monitoring to see if there is any imbalance in CPU utilization across the different TiKVs.

| username: 苏半生Su | Original post link

The CPU is normal, and the problematic KV node has been taken offline. Now the latency is not as high. There have been no changes in operations or business, so I don’t know the reason.

| username: 路在何chu | Original post link

Is there a problem with that KV hardware?

| username: 小龙虾爱大龙虾 | Original post link

First, look at the slow queries.

| username: 苏半生Su | Original post link

The load is about the same, just like before.

| username: 苏半生Su | Original post link

Thread CPU is normal.

| username: Jellybean | Original post link

An increase in latency usually indicates the occurrence of slow queries. It is necessary to analyze the slow log situation of the cluster through the Dashboard.

| username: 江湖故人 | Original post link

Has the IO performance of this offline node been tested? Is it the abnormal IO that causes the raft commit log to be slow?

| username: tidb狂热爱好者 | Original post link

Take a look, usually it’s a hotspot. Check the heatmap on the dashboard.

| username: andone | Original post link

Combine this with the step-by-step troubleshooting