How to Solve the Operator Timeout Issue During the Region-Balance Process

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: region-balance过程中operator timeout问题如何解决

| username: ryans

【TiDB Usage Environment】Production Environment
【TiDB Version】
5.1.2
【Encountered Problem: Problem Phenomenon and Impact】
Error: [operator_controller.go:566] [“operator timeout”]

Found a large number of operator timeouts during the region-balance process.

Checked DISK IO, NET, CPU with no obvious anomalies.

Checked the batch of abnormal regions and found they all have varchar1024 fields, similar to TiDB schedule operator timeout - #20,来自 ojsl - TiDB 的问答社区. However, the data on my side is not allowed to be deleted, so I cannot recover by deleting this batch of data. Currently, I have reduced the following parameters, but the timeout phenomenon is still frequent:

  • replica-schedule-limit 32
  • region-schedule-limit 512
  • max-snapshot-count 256
  • max-pending-peer-count 256

I want to know how to increase the timeout duration? Or what is causing this timeout?

【Key Logs】
PD logs

[2022/12/13 18:03:20.507 +07:00] [INFO] [operator_controller.go:566] ["operator timeout"] [region-id=34716892] [takes=10m0.583050683s] [operator="\"balance-region {mv peer: store [4] to [8]} (kind:region, region:34716892(1784,5556), createAt:2022-12-13 17:53:19.92478222 +0700 WIB m=+433284.308074027, startAt:2022-12-13 17:53:19.92485945 +0700 WIB m=+433284.308151282, currentStep:0, steps:[add learner peer 106156659 on store 8, use joint consensus, promote learner peer 106156659 on store 8 to voter, demote voter peer 34716894 on store 4 to learner, leave joint state, promote learner peer 106156659 on store 8 to voter, demote voter peer 34716894 on store 4 to learner, remove peer on store 4]) timeout\""]

Store 4 TiKV logs

[2022/12/13 17:53:19.689 +07:00] [INFO] [apply.rs:1370] ["execute admin command"] [command="cmd_type: ChangePeer change_peer { change_type: AddLearnerNode peer { id: 106156657 store_id: 8 role: Learner } }"] [index=22] [term=17] [peer_id=38182467] [region_id=38182463]
[2022/12/13 17:53:19.689 +07:00] [INFO] [apply.rs:1736] ["exec ConfChange"] [epoch="conf_ver: 13520 version: 2102"] [type=AddLearner] [peer_id=38182467] [region_id=38182463]
[2022/12/13 17:53:19.689 +07:00] [INFO] [apply.rs:1876] ["add learner successfully"] [region="id: 38182463 start_key: 7480000000000004FFC25F728000000099FFCDDD990000000000FA end_key: 7480000000000004FFC25F72800000009AFF242E650000000000FA region_epoch { conf_ver: 13520 version: 2102 } peers { id: 38182464 store_id: 32051340 role: Learner } peers { id: 38182465 store_id: 6 } peers { id: 38182466 store_id: 5 } peers { id: 38182467 store_id: 4 }"] [peer="id: 106156657 store_id: 8 role: Learner"] [peer_id=38182467] [region_id=38182463]
[2022/12/13 17:53:19.689 +07:00] [INFO] [raft.rs:2605] ["switched to configuration"] [config="Configuration { voters: Configuration { incoming: Configuration { voters: {38182465, 38182466, 38182467} }, outgoing: Configuration { voters: {} } }, learners: {38182464, 106156657}, learners_next: {}, auto_leave: false }"] [raft_id=38182467] [region_id=38182463]

Store 8 TiKV logs

[2022/12/13 17:53:19.690 +07:00] [INFO] [peer.rs:250] ["replicate peer"] [peer_id=106156657] [region_id=38182463]
[2022/12/13 17:53:19.690 +07:00] [INFO] [raft.rs:2605] ["switched to configuration"] [config="Configuration { voters: Configuration { incoming: Configuration { voters: {} }, outgoing: Configuration { voters: {} } }, learners: {}, learners_next: {}, auto_leave: false }"] [raft_id=106156657] [region_id=38182463]
[2022/12/13 17:53:19.690 +07:00] [INFO] [raft.rs:1088] ["became follower at term 0"] [term=0] [raft_id=106156657] [region_id=38182463]
[2022/12/13 17:53:19.690 +07:00] [INFO] [raft.rs:389] [newRaft] [peers="Configuration { incoming: Configuration { voters: {} }, outgoing: Configuration { voters: {} } }"] ["last term"=0] ["last index"=0] [applied=0] [commit=0] [term=0] [raft_id=106156657] [region_id=38182463]
[2022/12/13 17:53:19.690 +07:00] [INFO] [raw_node.rs:315] ["RawNode created with id 106156657."] [id=106156657] [raft_id=106156657] [region_id=38182463]
[2022/12/13 17:53:19.690 +07:00] [INFO] [raft.rs:1332] ["received a message with higher term from 38182466"] ["msg type"=MsgHeartbeat] [message_term=17] [term=0] [from=38182466] [raft_id=106156657] [region_id=38182463]
[2022/12/13 17:53:19.690 +07:00] [INFO] [raft.rs:1088] ["became follower at term 17"] [term=17] [raft_id=106156657] [region_id=38182463]
| username: ryans | Original post link

I found that the times in the two images below match quite well.

| username: Min_Chen | Original post link

Hello. It seems that the parameters are set too high, exceeding the cluster’s processing capacity, resulting in a high number of down-peers. I suggest you reduce the four parameters you mentioned, starting with the default values, and then gradually adjust the values while observing the cluster’s performance.

| username: ryans | Original post link

Hello, I initially saw some improvement by adjusting the parameters, but after a while, the timeout issue reappeared. Therefore, I suspect that certain regions have problems and are continuously retrying, which slows down the overall progress.

So, what I actually want to know is how to check the progress (logs) of a specific operator and what exactly it is doing at specific times. It seems that simply adjusting parameters won’t completely solve this problem.

| username: Min_Chen | Original post link

Hello,
After adjusting the parameters to smaller or default values, you can observe the monitoring and logs of TiKV to check the scheduling and replication performance to see if you can identify the issue.
You can also upload a copy of the clinic monitoring for us to review together.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.