The Specific Process of TiKV Leader Drop

translator_bot · June 23, 2024, 5:54am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv leader drop 具体过程

| username: 大鱼海棠

【TiDB Usage Environment】Production, Testing, Research
【TiDB Version】
【Encountered Problem】
Today I saw the description of leader drop on the TiKV official website and didn’t understand it. I would like to ask what the specific process of TiKV leader drop is and under what circumstances it occurs.
The official website says it is because the raftstore is busy, which feels a bit abstract.
【Reproduction Path】What operations were performed to encounter the problem
【Problem Phenomenon and Impact】

【Attachments】

Please provide the version information of each component, such as cdc/tikv, which can be obtained by executing cdc version/tikv-server --version.

translator_bot · June 23, 2024, 5:54am

| username: xfworld | Original post link

If the leader loses network connectivity or the heartbeat is lost, a re-election will occur…

Or if for any other reason the heartbeat cannot be received or sent, a new leader will replace the old one.

translator_bot · June 23, 2024, 5:54am

| username: 大鱼海棠 | Original post link

Does “leader drop” represent the monitoring of the number of re-elections for the region’s leader?

translator_bot · June 23, 2024, 5:54am

| username: xfworld | Original post link

It can be understood this way…
In addition, you can first troubleshoot hotspot issues, and then it is best to handle them from the PD scheduling and TiKV levels.

PD

If the TiKV pressure is very low, consider whether the PD scheduling is too frequent. You can check the Operator Create panel on the PD page to investigate the types and quantities of scheduling generated by PD.

TIKV

Rule description: This error is usually caused by the Raftstore thread being stuck, indicating that the TiKV is under significant pressure.
Handling methods:
1. Observe the Raft Propose monitoring to see if the TiKV node with the alarm is significantly higher than other TiKV nodes. If so, it indicates that there is a hotspot on this TiKV, and you need to check whether the hotspot scheduling is working properly.
2. Observe the Raft IO monitoring to see if the latency has increased. If the latency is very high, it indicates that there may be a bottleneck in the disk. One way to alleviate this, though not very safe, is to set sync-log to false.
3. Observe the Raft Process monitoring to see if the tick duration is very high. If so, you need to add raft-base-tick-interval = “2s” under the [raftstore] configuration.

translator_bot · June 23, 2024, 5:54am

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.