PD crashed for unknown reasons

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: pd不知明原因挂掉

| username: TiDBer_DVnkdzKV

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] 6.2
[Encountered Problem] PD crashed for unknown reasons
[Reproduction Path] No operations
[Problem Phenomenon and Impact]
Partial server connections interrupted

[Attachments]
tidb_stderr.log
{“level”:“warn”,“ts”:“2022-09-28T16:23:02.963+0800”,“logger”:“etcd-client”,“caller”:“v3@v3.5.2/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“etcd-endpoints://0xc000d0e000/172.21.0.14:2379”,“attempt”:0,“error”:“rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout”}

{“level”:“warn”,“ts”:“2022-09-28T16:23:04.963+0800”,“logger”:“etcd-client”,“caller”:“v3@v3.5.2/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“etcd-endpoints://0xc000d0e000/172.21.0.14:2379”,“attempt”:1,“error”:“rpc error: code = DeadlineExceeded desc = context deadline exceeded”}

{“level”:“warn”,“ts”:“2022-09-28T16:23:10.964+0800”,“logger”:“etcd-client”,“caller”:“v3@v3.5.2/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“etcd-endpoints://0xc000d0e000/172.21.0.14:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = context deadline exceeded”}

{“level”:“warn”,“ts”:“2022-09-28T16:23:16.964+0800”,“logger”:“etcd-client”,“caller”:“v3@v3.5.2/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“etcd-endpoints://0xc000d0e000/172.21.0.14:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = context deadline exceeded”}
pd_stderr.log
[2022/09/28 16:21:58.000 +08:00] [WARN] [retry_interceptor.go:62] [“retrying of unary invoker failed”] [target=endpoint://client-a0b85822-3db7-41ab-a2f9-123e14c288e8/172.21.0.14:2379] [attempt=0] [error=“rpc error: code = DeadlineExceeded desc = context deadline exceeded”]

[2022/09/28 16:23:06.755 +08:00] [WARN] [retry_interceptor.go:62] [“retrying of unary invoker failed”] [target=endpoint://client-a0b85822-3db7-41ab-a2f9-123e14c288e8/172.21.0.14:2379] [attempt=0] [error=“rpc error: code = DeadlineExceeded desc = context deadline exceeded”]

[2022/09/28 16:23:06.759 +08:00] [WARN] [retry_interceptor.go:62] [“retrying of unary invoker failed”] [target=endpoint://client-a0b85822-3db7-41ab-a2f9-123e14c288e8/172.21.0.14:2379] [attempt=0] [error=“rpc error: code = DeadlineExceeded desc = context deadline exceeded”]

[2022/09/28 16:21:50.048 +08:00] [WARN] [retry_interceptor.go:62] [“retrying of unary invoker failed”] [target=endpoint://client-a0b85822-3db7-41ab-a2f9-123e14c288e8/172.21.0.14:2379] [attempt=0] [error=“rpc error: code = DeadlineExceeded desc = context deadline exceeded”]

[2022/09/28 16:23:06.992 +08:00] [WARN] [retry_interceptor.go:62] [“retrying of unary invoker failed”] [target=endpoint://client-a0b85822-3db7-41ab-a2f9-123e14c288e8/172.21.0.14:2379] [attempt=0] [error=“rpc error: code = DeadlineExceeded desc = context deadline exceeded”]

[2022/09/28 16:23:10.065 +08:00] [WARN] [retry_interceptor.go:62] [“retrying of unary invoker failed”] [target=endpoint://client-a0b85822-3db7-41ab-a2f9-123e14c288e8/172.21.0.14:2379] [attempt=0] [error=“rpc error: code = Unavailable desc = etcdserver: leader changed”]

[2022/09/28 16:23:17.028 +08:00] [WARN] [retry_interceptor.go:62] [“retrying of unary invoker failed”] [target=endpoint://client-a0b85822-3db7-41ab-a2f9-123e14c288e8/172.21.0.14:2379] [attempt=1] [error=“rpc error: code = DeadlineExceeded desc = context deadline exceeded”]

[2022/09/28 16:23:20.182 +08:00] [WARN] [retry_interceptor.go:62] [“retrying of unary invoker failed”] [target=endpoint://client-a0b85822-3db7-41ab-a2f9-123e14c288e8/172.21.0.14:2379] [attempt=0] [error=“rpc error: code = DeadlineExceeded desc = context deadline exceeded”]

Don’t know what caused it.

| username: songxuecheng | Original post link

Please provide the PD logs. It looks like a network issue; check the network first.

| username: TiDBer_DVnkdzKV | Original post link

[2022/09/28 16:23:10.081 +08:00] [WARN] [server.go:1102] [“failed to revoke lease”] [lease-id=5e5482d8c84e1d91] [error=“lease not found”]

[2022/09/28 16:23:10.081 +08:00] [WARN] [server.go:1102] [“failed to revoke lease”] [lease-id=0a9282d8c82d36af] [error=“lease not found”]

[2022/09/28 16:23:10.081 +08:00] [WARN] [util.go:121] [“failed to apply request”] [took=44.354µs] [request=“header:<ID:9985187191419172187 > lease_revoke:id:5e5482d8c8512356”] [response=size:31] [error=“lease not found”]

[2022/09/28 16:23:10.081 +08:00] [WARN] [server.go:1102] [“failed to revoke lease”] [lease-id=5e5482d8c84a26c5] [error=“lease not found”]

[2022/09/28 16:23:10.081 +08:00] [WARN] [server.go:1102] [“failed to revoke lease”] [lease-id=5e5482d8c8512356] [error=“lease not found”]

[2022/09/28 16:23:10.081 +08:00] [WARN] [util.go:121] [“failed to apply request”] [took=39.925µs] [request=“header:<ID:9985187191419172188 > lease_revoke:id:5e5482d8c84a26c2”] [response=size:31] [error=“lease not found”]

[2022/09/28 16:23:10.081 +08:00] [WARN] [util.go:121] [“failed to apply request”] [took=31.62µs] [request=“header:<ID:9985187191419172189 > lease_revoke:id:1dd582d8c84f797b”] [response=size:31] [error=“lease not found”]

[2022/09/28 16:23:10.081 +08:00] [WARN] [util.go:121] [“failed to apply request”] [took=16.881µs] [request=“header:<ID:9985187191419172190 > lease_revoke:id:1dd582d8c84f7971”] [response=size:31] [error=“lease not found”]

[2022/09/28 16:23:10.081 +08:00] [WARN] [server.go:1102] [“failed to revoke lease”] [lease-id=5e5482d8c84a26c2] [error=“lease not found”]

[2022/09/28 16:23:10.081 +08:00] [WARN] [server.go:1102] [“failed to revoke lease”] [lease-id=1dd582d8c84f797b] [error=“lease not found”]

[2022/09/28 16:23:10.081 +08:00] [WARN] [util.go:121] [“failed to apply request”] [took=2.885µs] [request=“header:<ID:9985187191419172191 > put:<key:"/topology/tidb/172.21.0.4:4000/ttl" value_size:19 lease:6797201605192868870 >”] [response=] [error=“lease not found”]

[2022/09/28 16:23:10.081 +08:00] [WARN] [util.go:121] [“failed to apply request”] [took=27.812µs] [request=“header:<ID:11373140301580791609 > lease_revoke:id:1dd582d8c835e4bf”] [response=size:31] [error=“lease not found”]

[2022/09/28 16:23:10.081 +08:00] [WARN] [server.go:1102] [“failed to revoke lease”] [lease-id=1dd582d8c84f7971] [error=“lease not found”]

[2022/09/28 16:23:10.081 +08:00] [WARN] [util.go:121] [“failed to apply request”] [took=24.056µs] [request=“header:<ID:9985187191419172192 > lease_revoke:id:0a9282d8c8316d5b”] [response=size:31] [error=“lease not found”]

[2022/09/28 16:23:10.081 +08:00] [WARN] [server.go:1102] [“failed to revoke lease”] [lease-id=0a9282d8c8316d5b] [error=“lease not found”]

[2022/09/28 16:23:10.082 +08:00] [WARN] [util.go:121] [“failed to apply request”] [took=26.972µs] [request=“header:<ID:9985187191419172193 > lease_revoke:id:0a9282d8c8316d5d”] [response=size:31] [error=“lease not found”]

[2022/09/28 16:23:10.082 +08:00] [WARN] [server.go:1102] [“failed to revoke lease”] [lease-id=0a9282d8c8316d5d] [error=“lease not found”]

[2022/09/28 16:23:10.082 +08:00] [WARN] [util.go:121] [“failed to apply request”] [took=23.364µs] [request=“header:<ID:9985187191419172194 > lease_revoke:id:5e5482d8c84e1d89”] [response=size:31] [error=“lease not found”]

[2022/09/28 16:23:10.082 +08:00] [WARN] [server.go:1102] [“failed to revoke lease”] [lease-id=5e5482d8c84e1d89] [error=“lease not found”]

[2022/09/28 16:23:10.082 +08:00] [WARN] [util.go:121] [“failed to apply request”] [took=15.159µs] [request=“header:<ID:9985187191419172198 > lease_revoke:id:5e5482d8c83c6406”] [response=size:31] [error=“lease not found”]

[2022/09/28 16:23:10.082 +08:00] [WARN] [util.go:121] [“failed to apply request”] [took=29.035µs] [request=“header:<ID:9985187191419172199 > lease_revoke:id:5e5482d8c84a26be”] [response=size:31] [error=“lease not found”]

[2022/09/28 16:23:10.083 +08:00] [WARN] [server.go:1102] [“failed to revoke lease”] [lease-id=5e5482d8c83c6406] [error=“lease not found”]

[2022/09/28 16:23:10.083 +08:00] [WARN] [util.go:121] [“failed to apply request”] [took=38.502µs] [request=“header:<ID:9985187191419172200 > lease_revoke:id:5e5482d8c853f552”] [response=size:31] [error=“lease not found”]

[2022/09/28 16:23:10.083 +08:00] [WARN] [server.go:1102] [“failed to revoke lease”] [lease-id=5e5482d8c84a26be] [error=“lease not found”]

[2022/09/28 16:23:10.083 +08:00] [WARN] [util.go:121] [“failed to apply request”] [took=21.59µs] [request=“header:<ID:9985187191419172201 > lease_revoke:id:5e5482d8c853f560”] [response=size:31] [error=“lease not found”]

[2022/09/28 16:23:10.083 +08:00] [WARN] [server.go:1102] [“failed to revoke lease”] [lease-id=5e5482d8c853f552] [error=“lease not found”]

[2022/09/28 16:23:10.083 +08:00] [WARN] [util.go:121] [“failed to apply request”] [took=36.589µs] [request=“header:<ID:9985187191419172202 > lease_revoke:id:5e5482d8c853f555”] [response=size:31] [error=“lease not found”]

[2022/09/28 16:23:10.083 +08:00] [WARN] [server.go:1102] [“failed to revoke lease”] [lease-id=5e5482d8c853f560] [error=“lease not found”]

[2022/09/28 16:23:10.083 +08:00] [WARN] [util.go:121] [“failed to apply request”] [took=23.495µs] [request=“header:<ID:9985187191419172203 > lease_revoke:id:1dd582d8c8524e92”] [response=size:31] [error=“lease not found”]

[2022/09/28 16:23:10.083 +08:00] [WARN] [server.go:1102] [“failed to revoke lease”] [lease-id=1dd582d8c8524e92] [error=“lease not found”]

[2022/09/28 16:23:10.083 +08:00] [WARN] [server.go:1102] [“failed to revoke lease”] [lease-id=5e5482d8c853f555] [error=“lease not found”]

[2022/09/28 16:23:10.115 +08:00] [WARN] [util.go:121] [“failed to apply request”] [took=5.34µs] [request=“header:<ID:9985187191419172206 > put:<key:"/topology/tidb/172.21.0.4:4000/ttl" value_size:19 lease:6797201605192868870 >”] [response=] [error=“lease not found”]

[2022/09/28 16:23:10.148 +08:00] [WARN] [util.go:121] [“failed to apply request”] [took=3.547µs] [request=“header:<ID:11373140301580791611 > put:<key:"/topology/tidb/172.21.0.4:4000/ttl" value_size:19 lease:6797201605192868870 >”] [response=] [error=“lease not found”]

[2022/09/28 16:23:10.180 +08:00] [INFO] [cluster.go:389] [“metrics are reset”]

[2022/09/28 16:23:10.180 +08:00] [INFO] [cluster.go:391] [“metrics collection job has been stopped”]

[2022/09/28 16:23:10.181 +08:00] [INFO] [cluster.go:470] [“raftcluster is stopped”]

[2022/09/28 16:23:10.181 +08:00] [INFO] [tso.go:405] [“reset the timestamp in memory”]

[2022/09/28 16:23:11.544 +08:00] [WARN] [probing_status.go:70] [“prober detected unhealthy status”] [round-tripper-name=ROUND_TRIPPER_RAFT_MESSAGE] [remote-peer-id=ae7e6304d9ba5e54] [rtt=686.041µs] [error=“dial tcp 172.21.0.8:2380: i/o timeout”]

[2022/09/28 16:23:11.561 +08:00] [WARN] [util.go:121] [“failed to apply request”] [took=60.674µs] [request=“header:<ID:11373140301580791614 > txn:<compare:<target:CREATE key:"/tidb/telemetry/owner/5e5482d8c83919f7" create_revision:0 > success:<request_put:<key:"/tidb/telemetry/owner/5e5482d8c83919f7" value_size:36 lease:6797201605192653303 >> failure:<>>”] [response=] [error=“lease not found”]

[2022/09/28 16:23:11.561 +08:00] [WARN] [util.go:121] [“failed to apply request”] [took=40.626µs] [request=“header:<ID:9985187191419172209 > txn:<compare:<target:CREATE key:"/tidb/bindinfo/owner/1dd582d8c835e4bd" create_revision:0 > success:<request_put:<key:"/tidb/bindinfo/owner/1dd582d8c835e4bd" value_size:36 lease:2149768264722801853 >> failure:<request_range:<key:"/tidb/bindinfo/owner/1dd582d8c835e4bd" > >>”] [response=] [error=“lease not found”]

[2022/09/28 16:23:11.564 +08:00] [WARN] [util.go:121] [“failed to apply request”] [took=50.647µs] [request=“header:<ID:11373140301580791615 > lease_revoke:id:5e5482d8c83919f7”] [response=size:31] [error=“lease not found”]

[2022/09/28 16:23:11.564 +08:00] [WARN] [util.go:121] [“failed to apply request”] [took=44.553µs] [request=“header:<ID:11373140301580791616 > lease_revoke:id:1dd582d8c835e4bd”] [response=size:31] [error=“lease not found”]

[2022/09/28 16:23:11.591 +08:00] [WARN] [util.go:121] [“failed to apply request”] [took=4.137µs] [request=“header:<ID:9985187191419172215 > put:<key:"/topology/tidb/172.21.0.4:4000/ttl" value_size:19 lease:6797201605192868870 >”] [response=] [error=“lease not found”]

[2022/09/28 16:23:13.622 +08:00] [WARN] [util.go:121] [“failed to apply request”] [took=4.459µs] [request=“header:<ID:11373140301580791626 > put:<key:"/topology/tidb/172.21.0.5:4000/ttl" value_size:19 lease:6797201605193770686 >”] [response=] [error=“lease not found”]

[2022/09/28 16:23:14.444 +08:00] [INFO] [stream.go:250] [“set message encoder”] [from=b0dbc9ae56478a92] [to=b0dbc9ae56478a92] [stream-type=“stream Message”]

[2022/09/28 16:23:14.444 +08:00] [INFO] [peer_status.go:51] [“peer became active”] [peer-id=ae7e6304d9ba5e54]

[2022/09/28 16:23:14.444 +08:00] [WARN] [stream.go:277] [“established TCP streaming connection with remote peer”] [stream-writer-type=“stream Message”] [local-member-id=b0dbc9ae56478a92] [remote-peer-id=ae7e6304d9ba5e54]

[2022/09/28 16:23:14.500 +08:00] [INFO] [stream.go:425] [“established TCP streaming connection with remote peer”] [stream-reader-type=“stream MsgApp v2”] [local-member-id=b0dbc9ae56478a92] [remote-peer-id=ae7e6304d9ba5e54]

[2022/09/28 16:23:14.532 +08:00] [INFO] [stream.go:250] [“set message encoder”] [from=b0dbc9ae56478a92] [to=b0dbc9ae56478a92] [stream-type=“stream MsgApp v2”]

[2022/09/28 16:23:14.532 +08:00] [WARN] [stream.go:277] [“established TCP streaming connection with remote peer”] [stream-writer-type=“stream MsgApp v2”] [local-member-id=b0dbc9ae56478a92] [remote-peer-id=ae7e6304d9ba5e54]

[2022/09/28 16:23:14.539 +08:00] [INFO] [raft.go:978] [“b0dbc9ae56478a92 [logterm: 7, index: 47782817, vote: 1089c682c7149dd5] rejected MsgPreVote from ae7e6304d9ba5e54 [logterm: 6, index: 47782708] at term 7”]

[2022

| username: TiDBer_DVnkdzKV | Original post link

Here are some of the logs from that time.

| username: TiDBer_DVnkdzKV | Original post link

All three PDs have the same configuration and are from the same cloud service provider. The other PD services do not have this issue.

| username: xfworld | Original post link

Maybe the network stability will improve… :joy:

| username: 特雷西-迈克-格雷迪 | Original post link

It should be an issue with the main selection.

| username: TiDBer_DVnkdzKV | Original post link

It appears every day, is there any solution?

| username: xfworld | Original post link

You should definitely consult your cloud service provider about the cloud service you purchased. Why is the network so unstable…

| username: TiDBer_DVnkdzKV | Original post link

Are there any quantifiable metrics for this thing? You need evidence when dealing with the service provider.

| username: xfworld | Original post link

Does the operator not provide you with monitoring information?

Then internally, you can only check if there are any network interruptions. You can try using Prometheus…