Leader Scheduling Issue

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: Leader调度问题

| username: zzsh

【TiDB Usage Environment】Production Environment / Testing / PoC
【TiDB Version】V5.0.3

scheduler add evict-leader-scheduler

Scope of this command

Under what circumstances does it take effect
Under what circumstances does it not take effect.

| username: zzsh | Original post link

The current issue is:

Migrating all region leaders from store 1 using this command.

However, when the current PD leader node is abnormal and the PD leader switches automatically, this setting seems to be ineffective.

Is this setting a memory configuration and not permanent?

| username: 小龙虾爱大龙虾 | Original post link

Please provide the PD-related monitoring panel for review.

| username: DBRE | Original post link

Try taking the abnormal PD leader node offline first.

| username: xfworld | Original post link

If PD is abnormal, it needs to be reset.

PD is the scheduling center, and after switching PD, the scheduling instructions may become invalid.

It is recommended to observe whether there are changes in the leader information of all TiKV nodes. If there are changes, just wait.

The speed and efficiency of evicting nodes depend on the configuration and performance of the cluster, and there are other parameters that can accelerate eviction. Please refer to them!

| username: zhaokede | Original post link

Take effect immediately

| username: zzsh | Original post link

If the PD switches the leader, this will no longer be effective and needs to be reset, right?

| username: Jack-li | Original post link

Effective Situations:

  • When a node in the cluster is overloaded and needs to reduce its responsibility as a Raft leader.
  • Before performing maintenance on a node (such as upgrading or restarting) to avoid service interruption risks.
  • To adjust the data distribution in the cluster, improving overall read/write performance and stability.

Ineffective Situations:

  • The target node (the node from which the leader needs to be evicted) does not exist or is offline.
  • There are no other nodes in the cluster that can receive additional leaders, i.e., all other nodes are also in a high-load state.
  • The PD component of the TiDB cluster is not running properly and cannot handle scheduling requests.
  • Specific label rules or constraints have been manually set, preventing the scheduler from finding suitable nodes to migrate the leader.
| username: 小于同学 | Original post link

Take effect immediately

| username: yytest | Original post link

It should take effect immediately.

| username: zzsh | Original post link

Thank you very much.

| username: TiDBer_QYr0vohO | Original post link

Take effect immediately.

| username: zhh_912 | Original post link

Check the PD information.

| username: YuchongXU | Original post link

Check the monitoring and related logs.

| username: yytest | Original post link

Has the issue been resolved? Could you provide the logs?

| username: zzsh | Original post link

I would like to ask, if I set a certain store with this command, under what circumstances will this setting become invalid? For example, will a PD-leader switch cause it to become invalid? My cluster has not been configured with any other settings and remains at default.

| username: Jasper | Original post link

I tested it with version 7.5.1 and couldn’t reproduce your scenario. You can try manually transferring the leader and then check the current scheduling strategy through scheduler config evict-leader-scheduler.

| username: yytest | Original post link

Effective Scenarios

The evict-leader-scheduler scheduler will be effective in the following scenarios:

  1. Load Balancing: When the number of leaders on certain TiKV nodes is significantly higher than on other nodes, you can add the evict-leader-scheduler scheduler to balance the leader distribution, thereby achieving load balancing.
  2. Maintenance Operations: Before performing hardware upgrades, software updates, or other maintenance operations, you can migrate leaders away from the TiKV nodes that are about to undergo maintenance to ensure service continuity and data availability.
  3. Failover: When performance issues are detected on a TiKV node or a failure is imminent, you can preemptively evict the leaders from that node to quickly switch to healthy nodes.

Ineffective Scenarios

The evict-leader-scheduler scheduler may not be effective or necessary in the following scenarios:

  1. Healthy Nodes: If all TiKV nodes are healthy and the leader distribution is already relatively balanced, there is usually no need to add the evict-leader-scheduler scheduler.
  2. Resource Constraints: If cluster resources are limited, frequently migrating leaders may increase the burden on network and computing resources. In such cases, the use of evict-leader-scheduler should be cautious.
  3. Configuration Errors: If the command parameters for adding the scheduler are set incorrectly, such as specifying an incorrect TiKV node ID, the scheduler may not be able to correctly execute the eviction operation.

When using the evict-leader-scheduler, you should decide whether to add and how to configure the scheduler based on the actual cluster status and business needs. It is also recommended to regularly monitor the cluster’s operating conditions and adjust the scheduling strategy based on the monitoring results to ensure efficient and stable cluster operation.

| username: TiDBer_q2eTrp5h | Original post link

I have learned from this issue and encountered it before. It was resolved by rebuilding everything.

| username: Kongdom | Original post link

What I understand is that it should be persistent, and it is impossible to evict the store leader every time there is a switch. The current situation encountered should be due to the PD leader anomaly causing it not to be persistent.