[TiDB Usage Environment] Production Environment
[TiDB Version] v4.0.10
[Encountered Issue: Phenomenon and Impact] TiKV node error, reason: not_leader

In the production environment with a TiDB cluster of 5 KV nodes, version 4.0.10, an alert was suddenly received at noon on March 5th, reason: not_leader.
Upon checking the monitoring, there was a significant downward trend in the leaders of two KV nodes, which recovered in about 2 minutes, causing the business pods to restart.
There are some log entries on the PD nodes.

You can take a look at this article, it might be helpful.

Your version is a bit outdated, I suggest upgrading. Check if there are any hotspots during that time period on the UI panel.

not_leader occurs when accessing TiKV through region cache information and the leader on TiKV has already migrated to another node. It will then retry using the new node, which is normal behavior. From the TiKV leader monitoring, it seems that the two TiKVs might have experienced leader drops due to some reasons like slow response. You can check the leader drop and other information under TiKV detail → errors to confirm. Additionally, check the network monitoring to confirm whether the network of the two TiKVs is normal.

Within one minute, there were 337 schedulings…

Check this node…

Highly suspecting a hotspot issue.

I checked the monitoring of the party history, and indeed there are many leader drops.

