Monitoring and Alert Issues

translator_bot · June 20, 2024, 3:30pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 监控报警问题

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.5
[Reproduction Path] Modified some user permissions, but it doesn’t seem related
[Encountered Problem: Phenomenon and Impact] Query takes two to three minutes without results, business reports errors, no error information for now. This was found within the past week, abnormal monitoring matches the time of the issue (11:20)

translator_bot · June 20, 2024, 3:30pm

| username: zhaokede | Original post link

Is everything normal after the restore operation?

translator_bot · June 20, 2024, 3:30pm

| username: 健康的腰间盘 | Original post link

Without performing a restore operation, it fixed itself after two minutes.

translator_bot · June 20, 2024, 3:30pm

| username: 健康的腰间盘 | Original post link

The most reasonable explanation found so far, but I don’t know how to prove it

translator_bot · June 20, 2024, 3:30pm

| username: tidb菜鸟一只 | Original post link

There should be many logs similar to “epoch not match” in the logs of TiKV and TiDB at corresponding time points. This is generally due to the high load on the cluster, causing rapid changes in region information, which leads to constantly obtaining outdated region information.

translator_bot · June 20, 2024, 3:30pm

| username: 健康的腰间盘 | Original post link

It seems so. The high load state lasted for about 40 minutes, but TiDB resumed normal queries after being stuck for two or three minutes.

translator_bot · June 20, 2024, 3:30pm

| username: 鱼跃龙门 | Original post link

Too busy, back off more.

translator_bot · June 20, 2024, 3:30pm

| username: 小于同学 | Original post link

The load has increased recently.