The cluster got stuck twice, resulting in two batches of slow queries

translator_bot · June 23, 2024, 12:27am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 集群卡住两次，出现两批次慢查询

| username: magongyong

[TiDB Usage Environment] Production Environment
[TiDB Version] v5.4.3
[Encountered Problem]
Two batches of slow queries occurred today at 14:04:50 and 14:13, especially the one at 14:04. However, the actual business volume was not high, with QPS less than 30,000. The relevant monitoring is as follows:
Dashboard Slow Query

QPS Monitoring

CPU, memory, IO, and other monitoring are relatively normal, but the following monitoring indicators are abnormal. The official documentation does not provide the meaning of these monitoring indicators, so I do not understand them well.
tikv-details - scheduler-key_mvcc interface

scheduler-txn_heart_neat

[Reproduction Path] Operations performed that led to the problem
Normal business operations

[Problem Phenomenon and Impact]
Many slow queries occurred, and the cluster got stuck twice. It is suspected to be a network issue, but I want to know how to confirm it is a network issue from the monitoring.

[Attachments]

Please provide the version information of each component, such as cdc/tikv, which can be obtained by executing cdc version/tikv-server --version.

translator_bot · June 23, 2024, 12:27am

| username: xfworld | Original post link

Check if there are any anomalies in the status and number of regions.
There are quite a few MVCC keys, check if there have been a large number of delete operations recently.