TiDB Schedule Operator Timeout, Cluster Does Not Balance Region

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB schedule operator timeout,集群不做balace region

| username: robert233

[TiDB Usage Environment] Production Environment
[TiDB Version] v5.4.2
[Reproduction Path] None
[Encountered Problem: Phenomenon and Impact] From the monitoring, it appears that the schedule operator is constantly timing out. Attempting to scale out reveals that the leader is not being scheduled, and region is not being balanced, causing business operations to time out continuously.
[Partial Monitoring and Logs]

  • Monitoring

  • Logs
2023-05-22 11:36:02 (GMT+8) TiKV 10.200.132.98:20163 [<unknown>] ["New connected subchannel at 0x7f1ce50757f0 for subchannel 0x7f30c3e22540"]
2023-05-22 11:36:02 (GMT+8) TiKV 10.200.132.98:20163 [<unknown>] ["New connected subchannel at 0x7f217d76be00 for subchannel 0x7f30c3e22fc0"]
2023-05-22 11:36:02 (GMT+8) TiKV 10.200.132.98:20163 [<unknown>] ["New connected subchannel at 0x7f217dc405c0 for subchannel 0x7f30c3e23340"]
2023-05-22 11:36:02 (GMT+8) TiKV 10.200.132.98:20163 [advance.rs:295] ["check leader failed"] [to_store=420634] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
2023-05-22 11:36:02 (GMT+8) TiKV 10.200.132.98:20163 [advance.rs:295] ["check leader failed"] [to_store=1580508] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
2023-05-22 11:36:03 (GMT+8) TiKV 10.200.132.98:20163 [<unknown>] ["New connected subchannel at 0x7f21188a14d0 for subchannel 0x7f30c3e22540"]
2023-05-22 11:36:03 (GMT+8) TiKV 10.200.132.98:20163 [<unknown>] ["New connected subchannel at 0x7f216e40c4e0 for subchannel 0x7f30c3e22fc0"]
2023-05-22 11:36:03 (GMT+8) TiKV 10.200.132.98:20163 [advance.rs:295] ["check leader failed"] [to_store=420634] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
2023-05-22 11:36:03 (GMT+8) TiKV 10.200.132.98:20163 [advance.rs:295] ["check leader failed"] [to_store=420633] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
2023-05-22 11:36:03 (GMT+8) TiKV 10.200.132.98:20163 [advance.rs:295] ["check leader failed"] [to_store=1580508] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
| username: xfworld | Original post link

Is the cluster functioning normally despite such errors?

| username: robert233 | Original post link

tidb duration monitoring

| username: robert233 | Original post link

From the monitoring, it appears that neither the leader nor the region is balanced. Is a PD switch necessary?

| username: robert233 | Original post link

From the tikv logs, there are a large number of check leader failed errors, and these errors are all from tikv to tiflash:

[2023/05/22 16:16:33.795 +08:00] [INFO] [advance.rs:295] ["check leader failed"] [to_store=1580508] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
[2023/05/22 16:16:34.800 +08:00] [INFO] [advance.rs:295] ["check leader failed"] [to_store=420634] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
[2023/05/22 16:16:34.801 +08:00] [INFO] [advance.rs:295] ["check leader failed"] [to_store=420633] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
[2023/05/22 16:16:34.802 +08:00] [INFO] [advance.rs:295] ["check leader failed"] [to_store=1580508] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
[2023/05/22 16:16:35.803 +08:00] [INFO] [advance.rs:295] ["check leader failed"] [to_store=420634] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
[2023/05/22 16:16:35.804 +08:00] [INFO] [advance.rs:295] ["check leader failed"] [to_store=420633] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
[2023/05/22 16:16:35.805 +08:00] [INFO] [advance.rs:295] ["check leader failed"] [to_store=1580508] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
[2023/05/22 16:16:36.800 +08:00] [INFO] [advance.rs:295] ["check leader failed"] [to_store=420634] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
[2023/05/22 16:16:36.801 +08:00] [INFO] [advance.rs:295] ["check leader failed"] [to_store=420633] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
[2023/05/22 16:16:36.802 +08:00] [INFO] [advance.rs:295] ["check leader failed"] [to_store=1580508] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
[2023/05/22 16:16:37.811 +08:00] [INFO] [advance.rs:295] ["check leader failed"] [to_store=420634] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
[2023/05/22 16:16:37.812 +08:00] [INFO] [advance.rs:295] ["check leader failed"] [to_store=420633] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
[2023/05/22 16:16:37.813 +08:00] [INFO] [advance.rs:295] ["check leader failed"] [to_store=1580508] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
[2023/05/22 16:16:38.803 +08:00] [INFO] [advance.rs:295] ["check leader failed"] [to_store=420634] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
[2023/05/22 16:16:38.805 +08:00] [INFO] [advance.rs:295] ["check leader failed"] [to_store=420633] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
[2023/05/22 16:16:38.806 +08:00] [INFO] [advance.rs:295] ["check leader failed"] [to_store=1580508] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
[2023/05/22 16:16:39.832 +08:00] [INFO] [advance.rs:295] ["check leader failed"] [to_store=420634] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
[2023/05/22 16:16:39.833 +08:00] [INFO] [advance.rs:295] ["check leader failed"] [to_store=420633] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
[2023/05/22 16:16:39.834 +08:00] [INFO] [advance.rs:295] ["check leader failed"] [to_store=1580508] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
[2023/05/22 16:16:40.843 +08:00] [INFO] [advance.rs:295] ["check leader failed"] [to_store=420634] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
[2023/05/22 16:16:40.845 +08:00] [INFO] [advance.rs:295] ["check leader failed"] [to_store=420633] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
[2023/05/22 16:16:40.846 +08:00] [INFO] [advance.rs:295] ["check leader failed"] [to_store=1580508] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
[2023/05/22 16:16:41.842 +08:00] [INFO] [advance.rs:295] ["check leader failed"] [to_store=420634] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
[2023/05/22 16:16:41.844 +08:00] [INFO] [advance.rs:295] ["check leader failed"] [to_store=420633] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
[2023/05/22 16:16:41.845 +08:00] [INFO] [advance.rs:295] ["check leader failed"] [to_store=1580508] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
[2023/05/22 16:16:42.837 +08:00] [INFO] [advance.rs:295] ["check leader failed"] [to_store=420634] [error="\"[rpc failed] RpcFailure: 12-UNIMPLEMENTED\""]
| username: Billmay表妹 | Original post link

What operation was performed that caused the problem?

| username: robert233 | Original post link

I have checked all the node logs, and one of the node logs reported a different error: get snapshot failed

[2023/05/22 19:29:29.641 +08:00] [INFO] [scheduler.rs:517] ["get snapshot failed"] [err="Error(Request(message: \"EpochNotMatch current epoch of region 233969 is conf_ver: 209 version: 19544, but you sent conf_ver: 209 version: 19315\" epoch_not_match { current_regions { id: 233969 start_key: 7480000000000000FF495F698000000000FF0000030380000188FF431D172C03EC0000FF04144F527B000000FC end_key: 7480000000000000FF495F698000000000FF0000040125303025FF30302530FF302530FF3025303025FF3030FF253030253030FF25FF30302530302530FFFF3025303025303025FFFF30302530302530FF30FF253030253030FF2530FF3025303025FF303025FF30302530FF30253030FF253030FF2530302530FF3025FF303025303025FF30FF30253030253030FFFF2530302530302530FFFF30253030253030FF25FF303025303025FF3030FF2530302530FF302530FF30253030FF25303025FF303025FF3030253030FF2530FF302530302530FF30FF25303000000000FBFF038C000000DCA922FF5400000000000000F8 region_epoch { conf_ver: 209 version: 19544 } peers { id: 920709 store_id: 918725 } peers { id: 1233844 store_id: 2 } peers { id: 1452699 store_id: 1441108 } } current_regions { id: 2204582 start_key: 7480000000000000FF495F698000000000FF0000030380000188FF43099EE403B00000FF041440B0F8000000FC end_key: 7480000000000000FF495F698000000000FF0000030380000188FF431D172C03EC0000FF04144F527B000000FC region_epoch { conf_ver: 209 version: 19544 } peers { id: 2204583 store_id: 918725 } peers { id: 2204584 store_id: 2 } peers { id: 2204585 store_id: 1441108 } } current_regions { id: 2203237 start_key: 7480000000000000FF495F698000000000FF0000030380000188FF42F61AF103D40000FF041432227B000000FC end_key: 7480000000000000FF495F698000000000FF0000030380000188FF43099EE403B00000FF041440B0F8000000FC region_epoch { conf_ver: 209 version: 19543 } peers { id: 2203238 store_id: 918725 } peers { id: 2203239 store_id: 2 } peers { id: 2203240 store_id: 1441108 } } current_regions { id: 2201777 start_key: 7480000000000000FF495F698000000000FF0000030380000188FF42E3AE7403CC0000FF041423C243000000FC end_key: 7480000000000000FF495F698000000000FF0000030380000188FF42F61AF103D40000FF041432227B000000FC region_epoch { conf_ver: 209 version: 19542 } peers { id: 2201778 store_id: 918725 } peers { id: 2201779 store_id: 2 } peers { id: 2201780 store_id: 1441108 } } current_regions { id: 2200253 start_key: 7480000000000000FF495F698000000000FF0000030380000188FF42CF694503A00000FF041414D5F4000000FC end_key: 7480000000000000FF495F698000000000FF0000030380000188FF42E3AE7403CC0000FF041423C243000000FC region_epoch { conf_ver: 209 version: 19541 } peers { id: 2200254 store_id: 918725 } peers { id: 2200255 store_id: 2 } peers { id: 2200256 store_id: 1441108 } } current_regions { id: 2198838 start_key: 7480000000000000FF495F698000000000FF0000030380000188FF42B94BE703C40000FF041405DF65000000FC end_key: 7480000000000000FF495F698000000000FF0000030380000188FF42CF694503A00000FF041414D5F4000000FC region_epoch { conf_ver: 209 version: 19540 } peers { id: 2198839 store_id: 918725 } peers { id: 2198840 store_id: 2 } peers { id: 2198841 store_id: 1441108 } } current_regions { id: 2197255 start_key: 7480000000000000FF495F698000000000FF0000030380000188FF42A1B2AF03900000FF0413F75E9C000000FC end_key: 7480000000000000FF495F698000000000FF0000030380000188FF42B94BE703C40000FF041405DF65000000FC region_epoch { conf_ver: 209 version: 19539 } peers { id: 2197256 store_id: 918725 } peers { id: 2197257 store_id: 2 } peers { id: 2197258 store_id: 1441108 } } current_regions { id: 2195432 start_key: 7480000000000000FF495F698000000000FF0000030380000188FF42885F55038C0000FF0413E8906D000000FC end_key: 7480000000000000FF495F698000000000FF0000030380000188FF42A1B2AF03900000FF0413F75E9C000000FC region_epoch { conf_ver: 209 version: 19538 } peers { id: 2195433 store_id: 918725 } peers { id: 2195434 store_id: 2 } peers { id: 2195435 store_id: 1441108 } } current_regions { id: 2193606 start_key: 7480000000000000FF495F698000000000FF0000030380000188FF426E7D0503D80000FF0413DA3915000000FC end_key: 7480000000000000FF495F698000000000FF0000030380000188FF42885F55038C0000FF0413E8906D000000FC region_epoch { conf_ver: 209 version: 19537 } peers { id: 2193607 store_id: 918725 } peers { id: 2193608 store_id: 2 } peers { id: 2193609 store_id: 1441108 } } current_regions { id: 2191754 start_key: 7480000000000000FF495F698000000000FF0000030380000188FF4253FCC203B00000FF0413CBDE69000000FC end_key: 7480000000000000FF495F698000000000FF0000030380000188FF426E7D0503D80000FF0413DA3915000000FC region_epoch { conf_ver: 209 version: 19536 } peers { id: 2191755 store_id: 918725 } peers { id: 2191756 store_id: 2 } peers { id: 2191757 store_id: 1441108 } } current_regions { id: 2189965 start_key: 7480000000000000FF495F698000000000FF0000030380000188FF4239A1FE03D40000FF0413BCE6AC000000FC end_key: 7480000000000000FF495F698000000000FF0000030380000188FF4253FCC203B00000FF0413CBDE69000000FC region_epoch { conf_ver: 209 version: 19535 } peers { id: 2189966 store_id: 918725 } peers { id: 2189967 store_id: 2 } peers { id: 2189968 store_id: 1441108 } } }))"] [cid=5999975400]
| username: robert233 | Original post link

Identified the cause:
TiKV regions are not reporting heartbeats to PD. In version v5.4.2, there is a known bug: Master: two tikv don't report region heartbeat after inject fault to pd leader · Issue #12934 · tikv/tikv · GitHub

  1. Tried restarting PD, and after re-electing the PD leader, the scheduler started working.
  2. After upgrading the cluster to v5.4.3, no anomalies have been observed so far.
| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.