The questioner is a colleague, and I will supplement the current observed timeline.
- 2023/01/14 00:09:54
pd follower 61 and 63 reported alarms and logs, continuously for more than ten times, once per second
[2023/01/14 00:09:54.890 +08:00] [WARN] [manager.go:102] ["failed to reload persist options"]
[2023/01/14 00:09:55.890 +08:00] [WARN] [manager.go:102] ["failed to reload persist options"]
[2023/01/14 00:09:56.890 +08:00] [WARN] [manager.go:102] ["failed to reload persist options"]
[2023/01/14 00:09:57.890 +08:00] [WARN] [manager.go:102] ["failed to reload persist options"]
[2023/01/14 00:09:58.890 +08:00] [WARN] [manager.go:102] ["failed to reload persist options"]
[2023/01/14 00:09:59.890 +08:00] [WARN] [manager.go:102] ["failed to reload persist options"]
[2023/01/14 00:10:00.890 +08:00] [WARN] [manager.go:102] ["failed to reload persist options"]
...
- 2023/01/14 00:14:00
pd leader monitoring, grpc requests wait and tso exceed 10 seconds, and the tso rate suddenly increased to 500qps (previously it was 0)
Partial logs are as follows:
[2023/01/14 00:11:50.857 +08:00] [INFO] [grpc_service.go:1353] ["update service GC safe point"] [service-id=ticdc] [expire-at=1673712710] [safepoint=438731094848176136]
[2023/01/14 00:12:50.860 +08:00] [INFO] [grpc_service.go:1353] ["update service GC safe point"] [service-id=ticdc] [expire-at=1673712770] [safepoint=438731110589661194]
[2023/01/14 00:13:51.010 +08:00] [INFO] [grpc_service.go:1353] ["update service GC safe point"] [service-id=ticdc] [expire-at=1673712830] [safepoint=438731125492547608]
[2023/01/14 00:13:54.535 +08:00] [INFO] [periodic.go:135] ["starting auto periodic compaction"] [revision=39749914] [compact-period=1h0m0s]
[2023/01/14 00:13:54.536 +08:00] [INFO] [index.go:189] ["compact tree index"] [revision=39749914]
[2023/01/14 00:13:54.537 +08:00] [INFO] [periodic.go:146] ["completed auto periodic compaction"] [revision=39749914] [compact-period=1h0m0s] [took=1h0m0.004980515s]
[2023/01/14 00:13:55.392 +08:00] [INFO] [kvstore_compaction.go:55] ["finished scheduled compaction"] [compact-revision=39749914] [took=855.063741ms]
[2023/01/14 00:14:51.209 +08:00] [INFO] [grpc_service.go:1353] ["update service GC safe point"] [service-id=ticdc] [expire-at=1673712891] [safepoint=438731142073417748]
[2023/01/14 00:14:55.959 +08:00] [INFO] [grpc_service.go:1353] ["update service GC safe point"] [service-id=gc_worker] [expire-at=9223372036854775807] [safepoint=4387084949
02493184]
[2023/01/14 00:16:38.913 +08:00] [INFO] [grpc_service.go:1287] ["updated gc safe point"] [safe-point=438708494902493184]
[2023/01/14 00:20:22.006 +08:00] [INFO] [operator_controller.go:440] ["add operator"] [region-id=46845377] [operator="\"transfer-hot-read-leader {transfer leader: store 5 t
o 4} (kind:hot-region,leader, region:46845377(20817,41987), createAt:2023-01-14 00:20:22.006599095 +0800 CST m=+2034391.895374284, startAt:0001-01-01 00:00:00 +0000 UTC, cu
rrentStep:0, steps:[transfer leader from store 5 to store 4])\""] [additional-info=]
[2023/01/14 00:20:22.006 +08:00] [INFO] [operator_controller.go:638] ["send schedule command"] [region-id=46845377] [step="transfer leader from store 5 to store 4"] [source
=create]
[2023/01/14 00:20:22.008 +08:00] [INFO] [region.go:549] ["leader changed"] [region-id=46845377] [from=5] [to=4]
[2023/01/14 00:20:22.008 +08:00] [INFO] [operator_controller.go:555] ["operator finish"] [region-id=46845377] [takes=1.586244ms] [operator="\"transfer-hot-read-leader {tran
sfer leader: store 5 to 4} (kind:hot-region,leader, region:46845377(20817,41987), createAt:2023-01-14 00:20:22.006599095 +0800 CST m=+2034391.895374284, startAt:2023-01-14
00:20:22.006833891 +0800 CST m=+2034391.895609077, currentStep:1, steps:[transfer leader from store 5 to store 4]) finished\""] [additional-info=]
[2023/01/14 00:21:46.234 +08:00] [INFO] [operator_controller.go:440] ["add operator"] [region-id=46975492] [operator="\"transfer-hot-write-leader {transfer leader: store 4
to 7} (kind:hot-region,leader, region:46975492(15735,3806), createAt:2023-01-14 00:21:46.234616792 +0800 CST m=+2034476.123391980, startAt:0001-01-01 00:00:00 +0000 UTC, cu
rrentStep:0, steps:[transfer leader from store 4 to store 7])\""] [additional-info=]
[2023/01/14 00:21:46.234 +08:00] [INFO] [operator_controller.go:638] ["send schedule command"] [region-id=46975492] [step="transfer leader from store 4 to store 7"] [source
=create]
[2023/01/14 00:21:46.237 +08:00] [INFO] [region.go:549] ["leader changed"] [region-id=46975492] [from=4] [to=7]
[2023/01/14 00:21:46.237 +08:00] [INFO] [operator_controller.go:555] ["operator finish"] [region-id=46975492] [takes=2.933989ms] [operator="\"transfer-hot-write-leader {tra
nsfer leader: store 4 to 7} (kind:hot-region,leader, region:46975492(15735,3806), createAt:2023-01-14 00:21:46.234616792 +0800 CST m=+2034476.123391980, startAt:2023-01-14
00:21:46.234828062 +0800 CST m=+2034476.123603264, currentStep:1, steps:[transfer leader from store 4 to store 7]) finished\""] [additional-info=]
[2023/01/14 00:24:55.935 +08:00] [INFO] [grpc_service.go:1353] ["update service GC safe point"] [service-id=gc_worker] [expire-at=9223372036854775807] [safepoint=4387086521
89155328]
[2023/01/14 00:26:38.864 +08:00] [INFO] [grpc_service.go:1287] ["updated gc safe point"] [safe-point=438708652189155328]
[2023/01/14 00:34:55.959 +08:00] [INFO] [grpc_service.go:1353] ["update service GC safe point"] [service-id=gc_worker] [expire-at=9223372036854775807] [safepoint=4387088094
75293184]
[2023/01/14 00:36:38.860 +08:00] [INFO] [grpc_service.go:1287] ["updated gc safe point"] [safe-point=438708809475293184]
[2023/01/14 00:39:22.613 +08:00] [WARN] [allocator_manager.go:608] ["failed to update allocator's timestamp"] [dc-location=global] [error="[PD:etcd:ErrEtcdTxnConflict]etcd
transaction failed, conflicted and rolled back"]
[2023/01/14 00:39:22.613 +08:00] [INFO] [tso.go:406] ["reset the timestamp in memory"]
[2023/01/14 00:39:22.614 +08:00] [INFO] [lease.go:153] ["stop lease keep alive worker"] [purpose="pd leader election"]
[2023/01/14 00:39:22.643 +08:00] [INFO] [server.go:1319] ["no longer a leader because lease has expired, pd leader will step down"]
[2023/01/14 00:39:22.644 +08:00] [INFO] [coordinator.go:110] ["patrol regions has been stopped"]
[2023/01/14 00:39:22.644 +08:00] [INFO] [coordinator.go:799] ["scheduler has been stopped"] [scheduler-name=balance-hot-region-scheduler] [error="context canceled"]
[2023/01/14 00:39:22.644 +08:00] [INFO] [coordinator.go:285] ["drive push operator has been stopped"]
[2023/01/14 00:39:22.644 +08:00] [INFO] [cluster.go:331] ["metrics are reset"]
[2023/01/14 00:39:22.644 +08:00] [INFO] [server.go:119] ["region syncer has been stopped"]
[2023/01/14 00:39:22.644 +08:00] [INFO] [coordinator.go:799] ["scheduler has been stopped"] [scheduler-name=balance-leader-scheduler] [error="context canceled"]
[2023/01/14 00:39:22.644 +08:00] [INFO] [cluster.go:333] ["background jobs has been stopped"]
[2023/01/14 00:39:22.644 +08:00] [INFO] [coordinator.go:220] ["check suspect key ranges has been stopped"]
[2023/01/14 00:39:22.644 +08:00] [INFO] [cluster.go:353] ["statistics background jobs has been stopped"]
[2023/01/14 00:39:22.644 +08:00] [INFO] [cluster.go:375] ["coordinator is stopping"]
[2023/01/14 00:39:22.644 +08:00] [INFO] [coordinator.go:799] ["scheduler has been stopped"] [scheduler-name=balance-region-scheduler] [error="context canceled"]
[2023/01/14 00:39:22.645 +08:00] [INFO] [cluster.go:371] ["coordinator has been stopped"]
[2023/01/14 00:39:22.645 +08:00] [INFO] [cluster.go:404] ["raftcluster is stopped"]
[2023/01/14 00:39:22.645 +08:00] [INFO] [tso.go:406] ["reset the timestamp in memory"]
[2023/01/14 00:39:22.646 +08:00] [ERROR] [member.go:167] ["getting pd leader meets error"] [error="[PD:proto:ErrProtoUnmarshal]proto: illegal wireType 7"]
[2023/01/14 00:39:22.847 +08:00] [ERROR] [member.go:167] ["getting pd leader meets error"] [error="[PD:proto:ErrProtoUnmarshal]proto: illegal wireType 7"]
[2023/01/14 00:39:23.048 +08:00] [ERROR] [member.go:167] ["getting pd leader meets error"] [error="[PD:proto:ErrProtoUnmarshal]proto: illegal wireType 7"]
[2023/01/14 00:39:23.249 +08:00] [ERROR] [member.go:167] ["getting pd leader meets error"] [error="[PD:proto:ErrProtoUnmarshal]proto: illegal wireType 7"]
[2023/01/14 00:39:23.449 +08:00] [ERROR] [member.go:167] ["getting pd leader meets error"] [error="[PD:proto:ErrProtoUnmarshal]proto: illegal wireType 7"]
- 2023/01/14 00:15:25
cdc reported an error (synchronizing data to the backup tidb), as shown in the figure below