Online PD Abnormal Switch

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 线上pd异常切换

| username: xxxxxxxx

Version: TiDB 4.0.13
Operation: None
Issue: PD abnormal leader switch

The old leader reported the following errors (there are also some PD:etcd:ErrEtcdKVPut errors)
[2022/08/25 17:32:21.656 +08:00] [ERROR] [tso.go:302] [“invalid timestamp”] [timestamp={}]
[2022/08/25 17:32:21.665 +08:00] [ERROR] [tso.go:302] [“invalid timestamp”] [timestamp={}]
[2022/08/25 17:32:21.673 +08:00] [ERROR] [tso.go:302] [“invalid timestamp”] [timestamp={}]

The new leader logs are as follows

[2022/08/25 17:32:43.058 +08:00] [INFO] [raft.go:923] [“6cf576fb46c07868 is starting a new election at term 2”]
[2022/08/25 17:32:43.063 +08:00] [INFO] [raft.go:729] [“6cf576fb46c07868 became pre-candidate at term 2”]
[2022/08/25 17:32:43.063 +08:00] [INFO] [raft.go:824] [“6cf576fb46c07868 received MsgPreVoteResp from 6cf576fb46c07868 at term 2”]
[2022/08/25 17:32:43.063 +08:00] [INFO] [raft.go:811] [“6cf576fb46c07868 [logterm: 2, index: 34283891] sent MsgPreVote request to a5bb61ba213ecdc0 at term 2”]
[2022/08/25 17:32:43.063 +08:00] [INFO] [raft.go:811] [“6cf576fb46c07868 [logterm: 2, index: 34283891] sent MsgPreVote request to b6aed577daad8738 at term 2”]
[2022/08/25 17:32:43.063 +08:00] [INFO] [node.go:331] [“raft.node: 6cf576fb46c07868 lost leader b6aed577daad8738 at term 2”]
[2022/08/25 17:32:43.066 +08:00] [INFO] [raft.go:824] [“6cf576fb46c07868 received MsgPreVoteResp from a5bb61ba213ecdc0 at term 2”]
[2022/08/25 17:32:43.066 +08:00] [INFO] [raft.go:1302] [“6cf576fb46c07868 has received 2 MsgPreVoteResp votes and 0 vote rejections”]
[2022/08/25 17:32:43.067 +08:00] [INFO] [raft.go:713] [“6cf576fb46c07868 became candidate at term 3”]
[2022/08/25 17:32:43.067 +08:00] [INFO] [raft.go:824] [“6cf576fb46c07868 received MsgVoteResp from 6cf576fb46c07868 at term 3”]
[2022/08/25 17:32:43.067 +08:00] [INFO] [raft.go:811] [“6cf576fb46c07868 [logterm: 2, index: 34283891] sent MsgVote request to a5bb61ba213ecdc0 at term 3”]
[2022/08/25 17:32:43.067 +08:00] [INFO] [raft.go:811] [“6cf576fb46c07868 [logterm: 2, index: 34283891] sent MsgVote request to b6aed577daad8738 at term 3”]
[2022/08/25 17:32:43.067 +08:00] [INFO] [raft.go:824] [“6cf576fb46c07868 received MsgVoteResp from a5bb61ba213ecdc0 at term 3”]
[2022/08/25 17:32:50.469 +08:00] [ERROR] [client.go:172] [“region sync with leader meet error”] [error=“[PD:grpc:ErrGRPCRecv]rpc error: code = Canceled desc = context canceled”]
[2022/08/25 17:32:51.470 +08:00] [INFO] [server.go:1099] [“leader changed, try to campaign leader”]
[2022/08/25 17:32:51.470 +08:00] [INFO] [server.go:1115] [“start to campaign leader”] [campaign-leader-name=pd-192.168.1.2-13115]
[2022/08/25 17:32:51.472 +08:00] [INFO] [server.go:1134] [“campaign leader ok”] [campaign-leader-name=pd-192.168.1.2-13115]
[2022/08/25 17:32:51.485 +08:00] [INFO] [server.go:158] [“establish sync region stream”] [requested-server=pd-192.168.1.1-13115] [url=http://192.168.1.1:13115]
[2022/08/25 17:32:51.485 +08:00] [INFO] [server.go:176] [“requested server has already in sync with server”] [requested-server=pd-192.168.1.1-13115] [server=pd-192.168.1.2-13115] [last-index=1089874144]
[2022/08/25 17:32:51.492 +08:00] [INFO] [tso.go:298] [“sync hasn’t completed yet, wait for a while”]
[2022/08/25 17:32:51.495 +08:00] [INFO] [tso.go:298] [“sync hasn’t completed yet, wait for a while”]
[2022/08/25 17:32:51.514 +08:00] [INFO] [tso.go:298] [“sync hasn’t completed yet, wait for a while”]

It looks like there was an issue with updating the timestamp. I would like to understand what caused this issue (what are the possible reasons for timestamp update failures).

| username: xfworld | Original post link

After the PD leader fails, a re-election will be initiated, and a new term leader will be elected. The TSO will be reset (to ensure monotonicity and avoid duplication).

Other members will also synchronize this data, resulting in a brief period of unavailability.

Once the leader election and synchronization process is complete, the service can officially resume.

| username: forever | Original post link

Have you checked if there were any network issues at that time, and if there were any problems with the operating system time?

| username: xxxxxxxx | Original post link

I checked the network, no issues there. The operating system time cannot be traced back.

| username: Raymond | Original post link

Under normal circumstances, what is the term length of a PD leader?

| username: xfworld | Original post link

Found the image elsewhere, take a look for reference.

| username: Raymond | Original post link

I don’t quite understand. Does it mean that if it exceeds 3 seconds, it will trigger the PD leader election?

| username: alfred | Original post link

This means that this is a normal error log, and a re-election will be triggered if the original Leader times out for 3 seconds (default)?