TiDB Schedule Operator Timeout

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB schedule operator timeout

| username: ojsl

[TiDB Usage Environment] Production Environment
[TiDB Version] 5.4.1
[Encountered Issue] Most of the operators created by the scheduler (merge region, replace-rule-offline-peer) have timed out. Some nodes in the cluster have been taken offline, but the regions of one node have not decreased.
[Reproduction Path]
[Issue Phenomenon and Impact] The scheduling of the offline node cannot be completed. I want to know which parameter can adjust the timeout time of the operator, so I can adjust it to complete the schedule of this node.

[Attachment]

| username: xfworld | Original post link

Are the TiKV nodes too busy, causing them to be unresponsive?

| username: ojsl | Original post link

The image is not visible. Please provide the text you need translated.

| username: ojsl | Original post link

This data is very stable most of the time, and there is still some CPU remaining in TiKV.

| username: xfworld | Original post link

Not only the CPU, but also the disk, memory, and network need to be considered.

| username: ojsl | Original post link

The memory of TiKV is relatively stable. The TiKV instance on this node has a region that has not decreased:

| username: xfworld | Original post link

There are many reasons why the offline status remains unchanged, and it is quite complex.

For example:

  1. The new node is relatively busy and unable to respond, causing region migration scheduling issues.
  2. Distribution issues of region leaders between new and old nodes.
  3. PD’s scheduling strategy is not aggressive enough.

You can refer to these parameters for adjustments:

This requires extra attention:

| username: ojsl | Original post link

We have tried the configuration in this document, but the current issue is that most operators (merge region, replace-rule-offline-peer) will timeout no matter how high the limit is adjusted.

| username: xfworld | Original post link

Then check if the offline nodes still have regions that are leaders. If not, you can try forcing them offline.

| username: ojsl | Original post link

Another offline instance has a relatively large number of leaders, and the instance with more regions no longer has any leaders.

| username: ojsl | Original post link

When this instance first went offline, the leader didn’t change much until we restarted it, and then the leader started to decrease. Additionally, there are a few other instances where a small number of leaders are no longer changing.

| username: ojsl | Original post link

Does “forced offline” refer to --force?

| username: xfworld | Original post link

Yes, it’s “force.”

tiup scale-in --force forcibly/violently scales in (not recommended unless in special circumstances).

| username: ojsl | Original post link

However, there are still many empty regions on the cluster that cannot be merged. Forcing this node offline will not solve my problem.

| username: xfworld | Original post link

What’s going on…

| username: ojsl | Original post link

We replaced the TiKV nodes by first scaling out and then scaling in. To improve the scheduler speed, we increased some limit parameters. These values are the changes after increasing the limits.

| username: ojsl | Original post link

offline-peer refers to a region that is offline.

| username: Lucien-卢西恩 | Original post link

Hello~ This might require checking the placement rule and label configuration to see if there are any incorrect label configurations causing regions to not be properly scheduled under different domains of the same label layer. For example, using the Rack layer configuration, if you configure Rack1, Rack2, and Rack3, but the offline TiKV is configured as Rack2, and there are no other TiKV instances in Rack2 available to store the region, it will result in the regions of the offline TiKV not being scheduled.

| username: ojsl | Original post link

This has been checked, and the current configuration belongs to one rack.

| username: ojsl | Original post link

With the help of @jingyu ma, the store limit has been adjusted, and now the speed of the remove-orphan-peer operator is quite considerable. However, the replace-rule-offline-peer is only slightly faster than before and is still very slow. Specifically, the regions on some stores are dropping particularly slowly.