Urgent Help Needed: Sudden Performance Degradation in TiDB

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 紧急求助:tidb性能急剧下降

| username: porpoiselxj

[TiDB Usage Environment] Production Environment
[TiDB Version] v6.1.1
[Reproduction Path] Upgraded nearly a year ago, no recent adjustments made
[Encountered Problem: Symptoms and Impact]
In the past two days, the tikv grpc duration has suddenly increased, causing the entire cluster’s access to drop sharply. It feels like a bunch of SQL queries are all stuck at the front, not executing at all.


| username: h5n1 | Original post link

blackeexporter Check the network monitoring with nodeexporter

| username: porpoiselxj | Original post link

There is no latency detected on the network, and communication between cluster nodes is through 10 Gigabit fiber optics. It doesn’t seem to be a network issue.

| username: h5n1 | Original post link

Have you investigated slow SQL? Also, disk IO and CPU.

| username: porpoiselxj | Original post link

TiKV uses SSD, and the IO is usually quite high, with no significant changes recently. The durations for append log/commit log/apply log, etc., are also performing well without noticeable increases. CPU utilization is even lower, averaging below 20%.

We have always paid close attention to slow queries, and the execution plans for the currently blocked SQL queries seem to be fine.

| username: 裤衩儿飞上天 | Original post link

Post the monitoring screenshots; using data to speak will make it more credible.

| username: porpoiselxj | Original post link

Which monitoring charts are needed, thank you.
It would be best to list the detailed paths as well, much appreciated.

| username: 裤衩儿飞上天 | Original post link

In the post, you mentioned and the experts mentioned the related monitoring.

| username: porpoiselxj | Original post link




| username: h5n1 | Original post link

The network latency is already in seconds.

| username: porpoiselxj | Original post link

This only happens occasionally. Look at the avg data below, it’s only over 300 us.

| username: h5n1 | Original post link

This occasionally looks abnormal, usually, such high values are just a few milliseconds.

| username: porpoiselxj | Original post link

Okay, I’ll check with IT to see if there are any network issues.
Are there any other optimization directions?

| username: TiDBer_jYQINSnf | Original post link

Check pd-ctl store.
Yesterday, I encountered a cluster with 2 machines down. Multiple replicas of regions were disconnected and couldn’t function properly. After starting one, it returned to normal.

| username: porpoiselxj | Original post link

The instance seems to be running normally, and there haven’t been any crashes or restarts recently.

| username: 像风一样的男子 | Original post link

Take a look at the SQL statement analysis over the recent period to see which SQL statements are time-consuming and frequently called.

| username: porpoiselxj | Original post link

Kneeling in gratitude to the expert. After IT inspection, it was found that there was an issue with the optical module of one TiKV node, causing a large number of packet losses. After replacing the hardware, the performance immediately recovered. Thank you very much.

| username: zhanggame1 | Original post link

This error looks a lot like packet loss, which is actually quite easy to check.

| username: cassblanca | Original post link

After all this time, it turns out the network is to blame.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.