PD Server Role Switch Causes max_execution_time Parameter to Fail

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: pd server角色切换导致max_execution_time参数失效

| username: Hacker_Yv76YjBL

[TiDB Usage Environment] Production Environment
[TiDB Version] V5.3.3
[Encountered Problem: Problem Phenomenon and Impact]
The cluster is set with max_execution_time=3000. At 4:35 AM, the underlying network AVS component underwent a lossy upgrade, causing second-level network jitter, which led to a PD Server role switch. After the role switch, the cluster experienced a large number of slow queries exceeding 3 seconds, suspecting the max_execute_time parameter’s effectiveness.
Questions:
If the PD Server causes the parameter to become ineffective, when will this parameter take effect?
[Resource Configuration]
A large number of slow queries occurred after the PD Server role switch.


Through slow query analysis, it was found that the time spent obtaining timestamps on the PD increased.

Parameter settings

Before the PD Server switch, there were no slow queries exceeding 3 seconds. If they existed, they would be killed by the max_execute_time parameter.

[Attachments: Screenshots/Logs/Monitoring]
PD Server logs:
[2024/03/14 04:35:08.926 +08:00] [WARN] [grpc_service.go:1282] [“trying to update gc safe point”] [old-safe-point=448358726736543772] [new-safe-point=0]
[2024/03/14 04:35:12.678 +08:00] [WARN] [util.go:144] [“apply request took too long”] [took=999.920525ms] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/tidb/server/info" range_end:"/tidb/server/infp" “] [response=] [error=“context deadline exceeded”]
[2024/03/14 04:35:12.678 +08:00] [INFO] [trace.go:145] [“trace[1056177004] range”] [detail=”{range_begin:/tidb/server/info; range_end:/tidb/server/infp; }”] [duration=1.000001586s] [start=2024/03/14 04:35:11.678 +08:00] [end=2024/03/14 04:35:12.678 +08:00] [steps=”["trace[1056177004] ‘agreement among raft nodes before linearized reading’ (duration: 999.919526ms)"]”]
[2024/03/14 04:35:13.005 +08:00] [WARN] [raft.go:1011] [“c533035a5e8fb2a1 stepped down to follower since quorum is not active”]
[2024/03/14 04:35:13.005 +08:00] [INFO] [raft.go:700] [“c533035a5e8fb2a1 became follower at term 10”]
[2024/03/14 04:35:13.005 +08:00] [INFO] [node.go:331] [“raft.node: c533035a5e8fb2a1 lost leader c533035a5e8fb2a1 at term 10”]
[2024/03/14 04:35:13.038 +08:00] [INFO] [server.go:1316] [“etcd leader changed, resigns pd leadership”] [old-pd-leader-name=pd-10.252.122.33-2379]
[2024/03/14 04:35:13.038 +08:00] [INFO] [lease.go:153] [“stop lease keep alive worker”] [purpose=“pd leader election”]
[2024/03/14 04:35:13.040 +08:00] [INFO] [raft.go:1365] [“c533035a5e8fb2a1 no leader at term 10; dropping index reading msg”]
[2024/03/14 04:35:13.135 +08:00] [INFO] [raft.go:1365] [“c533035a5e8fb2a1 no leader at term 10; dropping index reading msg”]
[2024/03/14 04:35:13.168 +08:00] [ERROR] [middleware.go:104] [“redirect but server is not leader”] [from=pd-10.252.122.32-2379] [server=pd-10.252.122.33-2379] [error=“[PD:apiutil:ErrRedirect]redirect failed”]
[2024/03/14 04:35:13.239 +08:00] [ERROR] [middleware.go:104] [“redirect but server is not leader”] [from=pd-10.252.122.32-2379] [server=pd-10.252.122.33-2379] [error=“[PD:apiutil:ErrRedirect]redirect failed”]
[2024/03/14 04:35:13.240 +08:00] [ERROR] [middleware.go:104] [“redirect but server is not leader”] [from=pd-10.252.122.33-2379] [server=pd-10.252.122.33-2379] [error=“[PD:apiutil:ErrRedirect]redirect failed”]
[2024/03/14 04:35:13.241 +08:00] [ERROR] [middleware.go:104] [“redirect but server is not leader”] [from=pd-10.252.122.31-2379] [server=pd-10.252.122.33-2379] [error=“[PD:apiutil:ErrRedirect]redirect failed”]
[2024/03/14 04:35:13.270 +08:00] [ERROR] [middleware.go:104] [“redirect but server is not leader”] [from=pd-10.252.122.32-2379] [server=pd-10.252.122.33-2379] [error=“[PD:apiutil:ErrRedirect]redirect failed”]
[2024/03/14 04:35:13.371 +08:00] [ERROR] [middleware.go:104] [“redirect but server is not leader”] [from=pd-10.252.122.32-2379] [server=pd-10.252.122.33-2379] [error=“[PD:apiutil:ErrRedirect]redirect failed”]
[2024/03/14 04:35:13.473 +08:00] [ERROR] [middleware.go:104] [“redirect but server is not leader”] [from=pd-10.252.122.32-2379] [server=pd-10.252.122.33-2379] [error=“[PD:apiutil:ErrRedirect]redirect failed”]
[2024/03/14 04:35:13.574 +08:00] [ERROR] [middleware.go:104] [“redirect but server is not leader”] [from=pd-10.252.122.32-2379] [server=pd-10.252.122.33-2379] [error=“[PD:apiutil:ErrRedirect]redirect failed”]
[2024/03/14 04:35:13.574 +08:00] [ERROR] [middleware.go:104] [“redirect but server is not leader”] [from=pd-10.252.122.32-2379] [server=pd-10.252.122.33-2379] [error=“[PD:apiutil:ErrRedirect]redirect failed”]
[2024/03/14 04:35:14.039 +08:00] [INFO] [cluster.go:373] [“coordinator is stopping”]
[2024/03/14 04:35:14.039 +08:00] [INFO] [cluster.go:329] [“metrics are reset”]
[2024/03/14 04:35:14.039 +08:00] [INFO] [coordinator.go:799] [“scheduler has been stopped”] [scheduler-name=balance-hot-region-scheduler] [error=“context canceled”]
[2024/03/14 04:35:14.039 +08:00] [INFO] [server.go:111] [“region syncer has been stopped”]
[2024/03/14 04:35:14.039 +08:00] [INFO] [coordinator.go:799] [“scheduler has been stopped”] [scheduler-name=balance-leader-scheduler] [error=“context canceled”]
[2024/03/14 04:35:14.039 +08:00] [INFO] [cluster.go:331] [“background jobs has been stopped”]
[2024/03/14 04:35:14.039 +08:00] [INFO] [coordinator.go:220] [“check suspect key ranges has been stopped”]
[2024/03/14 04:35:14.039 +08:00] [INFO] [coordinator.go:799] [“scheduler has been stopped”] [scheduler-name=balance-region-scheduler] [error=“context canceled”]
[2024/03/14 04:35:14.039 +08:00] [INFO] [cluster.go:351] [“statistics background jobs has been stopped”]
[2024/03/14 04:35:14.039 +08:00] [INFO] [coordinator.go:285] [“drive push operator has been stopped”]
[2024/03/14 04:35:14.039 +08:00] [INFO] [coordinator.go:110] [“patrol regions has been stopped”]
[2024/03/14 04:35:14.039 +08:00] [INFO] [cluster.go:369] [“coordinator has been stopped”]
[2024/03/14 04:35:14.039 +08:00] [INFO] [cluster.go:402] [“raftcluster is stopped”]
[2024/03/14 04:35:14.039 +08:00] [INFO] [tso.go:406] [“reset the timestamp in memory”]
[2024/03/14 04:35:14.039 +08:00] [ERROR] [member.go:160] [“no etcd leader, check pd leader later”] [error=“[PD:member:ErrEtcdLeaderNotFound]etcd leader not found”]
[2024/03/14 04:35:14.240 +08:00] [ERROR] [member.go:160] [“no etcd leader, check pd leader later”] [error=“[PD:member:ErrEtcdLeaderNotFound]etcd leader not found”]
[2024/03/14 04:35:14.311 +08:00] [WARN] [util.go:144] [“apply request took too long”] [took=999.954906ms] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/tidb/server/info" range_end:"/tidb/server/infp" “] [response=] [error=“context deadline exceeded”]
[2024/03/14 04:35:14.312 +08:00] [INFO] [trace.go:145] [“trace[438333249] range”] [detail=”{range_begin:/tidb/server/info; range_end:/tidb/server/infp; }”] [duration=1.000021585s] [start=2024/03/14 04:35:13.311 +08:00] [end=2024/03/14 04:35:14.312 +08:00] [steps=”["trace[438333249] ‘agreement among raft nodes before linearized reading’ (duration: 999.953899ms)"]”]
[2024/03/14 04:35:14.327 +08:00] [WARN] [util.go:144] [“apply request took too long”] [took=1.000090807s] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/tidb/server/info" range_end:"/tidb/server/infp" “] [response=] [error=“context deadline exceeded”]
[2024/03/14 04:35:14.328 +08:00] [INFO] [trace.go:145] [“trace[917517654] range”] [detail=”{range_begin:/tidb/server/info; range_end:/tidb/server/infp; }”] [duration=1.000146307s] [start=2024/03/14 04:35:13.327 +08:00] [end=2024/03/14 04:35:14.328 +08:00] [steps=”["trace[917517654] ‘agreement among raft nodes before linearized reading’ (duration: 1.000089774s)"]”]
[2024/03/14 04:35:14.440 +08:00] [ERROR] [member.go:160] [“no etcd leader, check pd leader later”] [error=“[PD:member:ErrEtcdLeaderNotFound]etcd leader not found”]
[2024/03/14 04:35:14.595 +08:00] [ERROR] [middleware.go:104] [“redirect but server is not leader”] [from=pd-10.252.122.32-2379] [server=pd-10.252.122.33-2379] [error=“[PD:apiutil:ErrRedirect]redirect failed”]
[2024/03/14 04:35:14.640 +08:00] [ERROR] [member.go:160] [“no etcd leader, check pd leader later”] [error=“[PD:member:ErrEtcdLeaderNotFound]etcd leader not found”]
[2024/03/14 04:35:14.696 +08:00] [ERROR] [middleware.go:104] [“redirect but server is not leader”] [from=pd-10.252.122.32-2379] [server=pd-10.252.122.33-2379] [error=“[PD:apiutil:ErrRedirect]redirect failed”]
[2024/03/14 04:35:14.797 +08:00] [ERROR] [middleware.go:104] [“redirect but server is not leader”] [from=pd-10.252.122.32-2379] [server=pd-10.252.122.33-2379] [error=“[PD:apiutil:ErrRedirect]redirect failed”]

| username: WalterWj | Original post link

I have changed it to a bug report. Please improve the description of the phenomenon.

| username: tidb菜鸟一只 | Original post link

I think you need to first check if there are any network issues between your current other nodes and PD.

| username: TiDBer_rvITcue9 | Original post link

The modification failed, right? Why did it still redirect the leader?

| username: TiDBer_5cwU0ltE | Original post link

Could it be due to frequent switching…

| username: oceanzhang | Original post link

Looking at the logs, PD takes more than a second, but in reality, it might take more than just a second for TiDB to correctly obtain the TSO.

| username: WalterWj | Original post link

You can follow this ISSUE: `max_execution_time` doesn't contain get tso time · Issue #51893 · pingcap/tidb · GitHub

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.