Investigation of Business Stalling Issues Caused by Sudden Latency Increase of About 1 Minute

translator_bot · June 21, 2024, 10:08pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 延迟突然升高1分钟左右导致业务卡顿问题排查

| username: TiDBer_Y2d2kiJh

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] 2tidb 3pd 3tikv 2ha
[Reproduction Path] Around 19:39 yesterday, there was a delay increase of about 1 minute, causing business stuttering. When checking the overview, it was found that PD was re-monitored, and IO was normal at that time. There was no error information in the logs. How can this issue be located and troubleshooted? Will replacing the PD loader cause this situation, as shown in the figure:
[Encountered Problem: Phenomenon and Impact]
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

translator_bot · June 21, 2024, 10:08pm

| username: songxuecheng | Original post link

It looks like there is an issue with the TiKV node. Did you stop or decommission the node?

translator_bot · June 21, 2024, 10:08pm

| username: TiDBer_Y2d2kiJh | Original post link

At this time, no stop or offline TiKV nodes have been performed, but the information of the previously offline TiKV nodes is still displayed in the monitoring.
Abnormalstores

translator_bot · June 21, 2024, 10:08pm

| username: 大飞哥online | Original post link

Check the logs of the down TiKV node for any anomalies, and also check the TiKV monitoring for any anomalies.

translator_bot · June 21, 2024, 10:08pm

| username: songxuecheng | Original post link

Check the store using pd-ctl.

translator_bot · June 21, 2024, 10:08pm

| username: TiDBer_Y2d2kiJh | Original post link

There is information about the offline TiKV nodes in the store.

translator_bot · June 21, 2024, 10:08pm

| username: songxuecheng | Original post link

What is the status?

translator_bot · June 21, 2024, 10:08pm

| username: TiDBer_Y2d2kiJh | Original post link

Down, same as this status
Abnormalstores

translator_bot · June 21, 2024, 10:08pm

| username: songxuecheng | Original post link

Please send a screenshot of the pd-ctl store.

translator_bot · June 21, 2024, 10:08pm

| username: TiDBer_Y2d2kiJh | Original post link

The image you provided is not accessible. Please provide the text content that needs to be translated.

translator_bot · June 21, 2024, 10:08pm

| username: kkpeter | Original post link

Switching the PD leader might restore it.

translator_bot · June 21, 2024, 10:08pm

| username: TiDBer_Y2d2kiJh | Original post link

This TiKV node has already been scaled down and does not need to be restored. The question now is why this node’s information suddenly reappeared in the monitoring. It might be because I didn’t clean up the information properly before.

translator_bot · June 21, 2024, 10:08pm

| username: kkpeter | Original post link

We have encountered this issue before. It occasionally occurs when the PD leader switches to an old node, but switching back resolves it. It seems to be a caching problem.