The cluster suddenly responds very slowly, but recovers after a restart. How to troubleshoot?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 集群突然响应特别慢,重启后恢复,如何排查?

| username: Kongdom

[TiDB Usage Environment] Production Environment
[TiDB Version] v6.5.3
[Reproduction Path] None
[Encountered Problem: Problem Phenomenon and Impact]
During normal use, the cluster suddenly became very slow to respond, and both the dashboard and Grafana could not be opened. After restarting the cluster, the cluster response returned to normal.

The following error messages were found in the logs:


Related monitoring screenshots obtained after the restart:




image

Cluster configuration: tidb+pd mixed deployment on 3 nodes, tikv on 3 nodes.

Log files:
122.tidb.log (31.3 KB)
123.tidb.log (12.8 MB)
124.tidb.log (22.9 KB)
125.tikv.log (966 bytes)
126.tikv.log (3.1 MB)
127.tikv.log (72.1 MB)

| username: lemonade010 | Original post link

Check the expensive query in tidb.log. The difference between the expensive query log and the slow query log is that the slow query log is printed after the statement is executed, while the expensive query log can print out the relevant information of the statement being executed. When a statement reaches the resource usage threshold (execution time/memory usage) during execution, TiDB will immediately write the relevant information of this statement into the log.

| username: Kongdom | Original post link

The TiDB log level is set to error, and I have searched the TiDB logs of all three nodes, but there are no expensive queries.

| username: lemonade010 | Original post link

Does the GC task execute normally?

| username: Kongdom | Original post link

How to check if it’s normal? Are there any indicators?

It looks normal here.

| username: xfworld | Original post link

Your server has lost connection. What was it doing before it went offline…

| username: Kongdom | Original post link

Inquired on-site, there were no abnormal operations, all were normal business operations.

| username: xfworld | Original post link

Is it a bug…

Are there any logs from earlier that we can refer to?

| username: Kongdom | Original post link

No, the log level of the three components is all set to error. In TiKV, the main issues are connection timeouts and raft errors.

| username: xfworld | Original post link

During this time period, did this machine crash?

| username: Kongdom | Original post link

It might be the firewall. After turning off the firewall this morning, the error didn’t occur anymore.

| username: TiDBer_H5NdJb5Q | Original post link

Are there a large number of write operations, and is the region being scheduled?

| username: Kongdom | Original post link

There is not much writing.

| username: 小龙虾爱大龙虾 | Original post link

Seeing many network-related errors

| username: Kongdom | Original post link

Yes, after disabling the firewall on 125, there are no more timeout errors.

| username: TiDBer_QYr0vohO | Original post link

That means the network issue is caused by the firewall policy.

| username: Kongdom | Original post link

Still following up and observing…

| username: xfworld | Original post link

Enabling the firewall in a production environment, what were you thinking… And not even excluding the whitelist…

| username: Kongdom | Original post link

:thinking: Not sure if this is the issue here, normally there should be a whitelist.

| username: Kongdom | Original post link

It is basically confirmed that the firewall caused it. Closing the thread.