Investigation of Business Stalling Issues Caused by Sudden Latency Increase of About 1 Minute

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 延迟突然升高1分钟左右导致业务卡顿问题排查

| username: TiDBer_Y2d2kiJh

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] 2tidb 3pd 3tikv 2ha
[Reproduction Path] Around 19:39 yesterday, there was a delay increase of about 1 minute, causing business stuttering. When checking the overview, it was found that PD was re-monitored, and IO was normal at that time. There was no error information in the logs. How can this issue be located and troubleshooted? Will replacing the PD loader cause this situation, as shown in the figure:
[Encountered Problem: Phenomenon and Impact]
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

| username: songxuecheng | Original post link

It looks like there is an issue with the TiKV node. Did you stop or decommission the node?

| username: TiDBer_Y2d2kiJh | Original post link

At this time, no stop or offline TiKV nodes have been performed, but the information of the previously offline TiKV nodes is still displayed in the monitoring.
Abnormalstores

| username: 大飞哥online | Original post link

Check the logs of the down TiKV node for any anomalies, and also check the TiKV monitoring for any anomalies.

| username: songxuecheng | Original post link

Check the store using pd-ctl.

| username: TiDBer_Y2d2kiJh | Original post link

There is information about the offline TiKV nodes in the store.

| username: songxuecheng | Original post link

What is the status?

| username: TiDBer_Y2d2kiJh | Original post link

Down, same as this status
Abnormalstores

| username: songxuecheng | Original post link

Please send a screenshot of the pd-ctl store.

| username: TiDBer_Y2d2kiJh | Original post link

The image you provided is not accessible. Please provide the text content that needs to be translated.

| username: kkpeter | Original post link

Switching the PD leader might restore it.

| username: TiDBer_Y2d2kiJh | Original post link

This TiKV node has already been scaled down and does not need to be restored. The question now is why this node’s information suddenly reappeared in the monitoring. It might be because I didn’t clean up the information properly before.

| username: kkpeter | Original post link

We have encountered this issue before. It occasionally occurs when the PD leader switches to an old node, but switching back resolves it. It seems to be a caching problem.