Kill -9 pd-server leader IO interruption for 18 seconds

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: kill -9 pd-server leader IO 中断 18s

| username: TiDBer_Lm1H3bCW

[TiKV Usage Environment]
The cluster consists of 3 physical nodes, each node running a mixed deployment of one TiKV service and one PD service. Fault testing is conducted using go-ycsb.

[TiKV Version]
v6.5.1

[Reproduction Path]
With all cluster services in a normal state, use go-ycsb for testing. After a period of time, kill -9 pd-server-leader, resulting in an IO interruption time of around 18 seconds.


[Attachments: Screenshots/Logs/Monitoring]
Observing pd.log, a re-election starts 3 seconds after killing the PD leader.

The leader election is completed within 1 second.

However, it takes 9 seconds to update the leadership after the leader is elected. We think this duration is too long but are unsure how to adjust it. Please advise.

| username: Billmay表妹 | Original post link

Based on the information you provided, it can be seen that after the PD leader is killed, the re-election starts after 3 seconds and the leader election is completed within 1 second. However, it takes 9 seconds to update the leadership after the leader is elected.

This issue may be due to the PD leader’s lease time being too long. The PD leader’s lease time is set through the lease parameter in the PD configuration file. By default, the PD leader’s lease time is 9 seconds. If the PD leader’s lease time in your cluster is too long, after the PD leader is killed and a new PD leader is re-elected, it will take until the lease time expires to update the leadership.

You can try setting the PD leader’s lease time to a shorter duration, such as 3 seconds, to reduce the leadership update time. In the PD configuration file, you can modify the PD leader’s lease time by setting the lease parameter. For example:

[raft]
...
# PD leader lease time (in seconds)
lease = 3
...

After modifying the configuration file, you need to restart the PD process for the changes to take effect.

Additionally, if the PD leader in your cluster is frequently killed, it is recommended to check whether the network environment and hardware resources of the cluster meet the requirements of the TiDB cluster, and whether there are other issues causing the PD leader to be frequently killed.

| username: TiDBer_Lm1H3bCW | Original post link

The current PD configuration has the lease set to 3s:

More detailed configuration information can be found in the attachment:
pd-config (6.6 KB)

If I try to set raft.lease, it reports an invalid configuration. Is it true that only lease is available in v6.5.1?

After confirming that the current lease for all nodes is 3, I restarted all PD servers and retested by killing the PD leader with kill -9. The phenomenon is still an 18s IO interruption. Are there any other parameters that need to be adjusted?

| username: h5n1 | Original post link

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.