Monitoring and Alert Issues

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 监控报警问题

| username: 健康的腰间盘

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.5
[Reproduction Path] Modified some user permissions, but it doesn’t seem related
[Encountered Problem: Phenomenon and Impact] Query takes two to three minutes without results, business reports errors, no error information for now. This was found within the past week, abnormal monitoring matches the time of the issue (11:20)

| username: zhaokede | Original post link

Is everything normal after the restore operation?

| username: 健康的腰间盘 | Original post link

Without performing a restore operation, it fixed itself after two minutes.

| username: 健康的腰间盘 | Original post link

The most reasonable explanation found so far, but I don’t know how to prove it :sweat_smile:

| username: tidb菜鸟一只 | Original post link

There should be many logs similar to “epoch not match” in the logs of TiKV and TiDB at corresponding time points. This is generally due to the high load on the cluster, causing rapid changes in region information, which leads to constantly obtaining outdated region information.

| username: 健康的腰间盘 | Original post link

It seems so. The high load state lasted for about 40 minutes, but TiDB resumed normal queries after being stuck for two or three minutes.

| username: 鱼跃龙门 | Original post link

Too busy, back off more.

| username: 小于同学 | Original post link

The load has increased recently.