PD Leader Error

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: pd leader报错

| username: 超7成网友

As mentioned, the PD leader node encountered an anomaly, resulting in a leader switch.

The logs of the PD leader node anomaly are as follows:

[2024/06/03 09:51:22.727 +08:00] [WARN] [raft.go:363] [“leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk”] [to=cd8ec02f78c9a682] [heartbeat-interval=500ms] [expected-duration=1s] [exceeded-duration=63.080465ms]
[2024/06/01 09:51:22.733 +08:00] [WARN] [raft.go:363] [“leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk”] [to=8344550797e5521f] [heartbeat-interval=500ms] [expected-duration=1s] [exceeded-duration=69.866043ms]
[2024/06/01 09:51:45.824 +08:00] [WARN] [raft.go:363] [“leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk”] [to=cd8ec02f78c9a682] [heartbeat-interval=500ms] [expected-duration=1s] [exceeded-duration=161.18652ms]
[2024/06/01 09:51:45.829 +08:00] [WARN] [raft.go:363] [“leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk”] [to=8344550797e5521f] [heartbeat-interval=500ms] [expected-duration=1s] [exceeded-duration=166.628628ms]
[2024/06/01 09:42:09.772 +08:00] [WARN] [raft.go:363] [“leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk”] [to=cd8ec02f78c9a682] [heartbeat-interval=500ms] [expected-duration=1s] [exceeded-duration=108.628828ms]
[2024/06/01 09:42:09.780 +08:00] [WARN] [raft.go:363] [“leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk”] [to=8344550797e5521f] [heartbeat-interval=500ms] [expected-duration=1s] [exceeded-duration=117.112641ms]
[2024/06/01 09:52:26.796 +08:00] [WARN] [raft.go:363] [“leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk”] [to=cd8ec02f78c9a682] [heartbeat-interval=500ms] [expected-duration=1s] [exceeded-duration=16.07931ms]
[2024/06/01 09:52:26.806 +08:00] [WARN] [raft.go:363] [“leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk”] [to=8344550797e5521f] [heartbeat-interval=500ms] [expected-duration=1s] [exceeded-duration=25.979491ms]
[2024/06/01 09:52:54.888 +08:00] [ERROR] [server.go:229] [“region syncer send data meet error”] [error=“rpc error: code = Canceled desc = context canceled”]
[2024/06/01 09:52:54.906 +08:00] [ERROR] [server.go:229] [“region syncer send data meet error”] [error=“rpc error: code = Canceled desc = context canceled”]
[2024/06/01 09:52:54.907 +08:00] [INFO] [server.go:238] [“region syncer delete the stream”] [stream=pd_kv03]
[2024/06/01 09:52:54.907 +08:00] [INFO] [server.go:238] [“region syncer delete the stream”] [stream=pd_kv05]
[2024/06/01 09:43:39.161 +08:00] [WARN] [node.go:408] [“e4b9b3f5dd8bcee4 (leader true) A tick missed to fire. Node blocks too long!”]

| username: TiDBer_ZxWlj6A1 | Original post link

It looks like your leader’s disk is slow, with a heartbeat of 500ms, causing the region to be unable to synchronize.

| username: 这里介绍不了我 | Original post link

The load on this machine is high, right? Also, can this cluster version be upgraded?

| username: songxuecheng | Original post link

Hybrid deployment?

| username: 超7成网友 | Original post link

Mixed deployment, both PD server and KV are present.

| username: songxuecheng | Original post link

For resource issues, directly check the current machine’s CPU, IO, etc., and locate the specific process.

| username: Billmay表妹 | Original post link

Upgraded~
Taking advantage of this upgrade event:

| username: 濱崎悟空 | Original post link

Such an old version~~

| username: 像风一样的男子 | Original post link

I suggest upgrading; the version is too old.

| username: Jack-li | Original post link

Upgrade it. The community also has this event now.

| username: tony5413 | Original post link

The version is quite old.

| username: 友利奈绪 | Original post link

I suggest upgrading to see if it helps.

| username: TiDBer_QYr0vohO | Original post link

3.0.9

| username: 超7成网友 | Original post link

After PD re-elects a new leader, the new PD leader’s logs keep showing the following information. Even after re-electing the PD leader, the cluster still has access issues. Why is this happening? Restarting the new leader and reverting the leader back to the previous one restores the cluster quickly.

[2024/06/04 19:38:52.576 +08:00] [WARN] [cluster_info.go:92] [“region is stale”] [error="region is stale: region id:890419759 start_key:"t\200\000\000\000\000\020\317\377\316_r\200\000\000\000\000\377\003\333\241\000\000\000\000\000\372" end_key:"t\200\000\000\000\000\020\317\377\316_r\200\000\000\000\000\377\003\333\360\000\000\000\000\000\372" region_epoch:<conf_ver:1625 version:366449 > peers:<id:890419760 store_id:1 > peers:<id:890419761 store_id:4 > peers:<id:890419762 store_id:8 > origin id:890395107 start_key:"t\200\000\000\000\000\020\317\377\316_r\200\000\000\000\000\377\003\333\241\000\000\000\000\000\372" end_key:"t\200\000\000\000\000\020\317\377\316_r\200\000\000\000\000\377\003\333\360\000\000\000\000\000\372" region_epoch:<conf_ver:1625 version:366477 > peers:<id:890395108 store_id:1 > peers:<id:890395109 store_id:4 > peers:<id:890395110 store_id:8 > "] [origin=]
[2024/06/04 19:38:52.576 +08:00] [WARN] [cluster_info.go:92] [“region is stale”] [error="region is stale: region id:890419776 start_key:"t\200\000\000\000\000\0234\377#_r\200\000\000\000\001\377{\252\003\000\000\000\000\000\372" end_key:"t\200\000\000\000\000\0234\377#_r\200\000\000\000\001\377\202\235h\000\000\000\000\000\372" region_epoch:<conf_ver:3998 version:399473 > peers:<id:890419777 store_id:8 > peers:<id:890419778 store_id:4 > peers:<id:890972994 store_id:10 > origin id:890311137 start_key:"t\200\000\000\000\000\0234\377#_r\200\000\000\000\001\377{\252\003\000\000\000\000\000\372" end_key:"t\200\000\000\000\000\0234\377#_r\200\000\000\000\001\377\202\235h\000\000\000\000\000\372" region_epoch:<conf_ver:3992 version:399487 > peers:<id:890311138 store_id:8 > peers:<id:890311139 store_id:4 > peers:<id:890311140 store_id:10 > "] [origin=]

| username: WalterWj | Original post link

Is your TiKV reporting PD information normally? How about checking the network first? Can TiKV connect to the PD port?

| username: Kongdom | Original post link

It is possible that the new PD node is not responding in a timely manner due to mixed deployment, or there are communication issues.