Why can't TiDB v4.0.14 provide service when the PD leader node's data disk is full, and why doesn't the leader automatically switch?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB v4.0.14, PD leader 节点数据盘写满的情况下无法提供服务, 为什么leader 没有自动做切换呢

| username: kkpeter

[TiDB Usage Environment] Production Environment / Testing / PoC
TiDB v4.0.14, the PD leader node could not provide service when the data disk was full, and the other PD nodes did not re-elect a new leader, causing the cluster to be completely unresponsive for 20 minutes.

The cluster returned to normal after the PD leader node was restored.

[TiDB Version]
TiDB v4.0.14
[Reproduction Path] Operations performed that led to the issue
[Encountered Issue: Problem Phenomenon and Impact]
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

| username: kkpeter | Original post link

In this situation, it should be considered that the PD leader node has gone down. If there are three PDs, the other two should be able to elect a new leader on their own to continue providing services.

| username: xfworld | Original post link

There are only three PDs, and one of them has already failed, so in reality, only 2 PDs are alive. This cannot satisfy the election scenario. It can only ensure that the PD data is consistent at this time and will not be lost.

If you can promptly add a PD node, an election will be initiated, and the election will also take some time.

If it is a production environment, you need to evaluate how long the entire cycle will take to restore service.

| username: kkpeter | Original post link

That means to ensure that a cluster can still normally elect a leader and maintain service after the PD leader goes down, at least 4 to 5 nodes are needed.

| username: kkpeter | Original post link

If the PD leader node crashes and the PD cluster is already abnormal, can a new node be added by joining at this time?

| username: xfworld | Original post link

Generally, it depends on the level of high availability required. For example, TiKV nodes can use 5 nodes with 3 replicas, or 5 nodes with 5 replicas.

PD can also refer to this kind of planning.

As for the PD cluster not functioning properly, you can try using scale-out to see if it works! If that doesn’t work, you can consider using PD Recover for recovery, but this approach is relatively risky.

| username: kkpeter | Original post link

So it seems that planning for 3 PD nodes is actually not highly available.

| username: xfworld | Original post link

The requirements are different :see_no_evil:

| username: xfworld | Original post link

It can be considered highly available because no data was lost, it is easy to recover, and services can continue to be provided.

| username: kkpeter | Original post link

The principle of scaling out a PD cluster is to join a new node into the existing cluster. I really doubt that this can be successful in this state, but I will test it.

As for using pd-recover to recover, it can’t be considered highly available.

I think there might still be some issues with your explanation.

| username: kkpeter | Original post link

PD has three nodes, and only one node is down, not two nodes. Why does it not meet the election conditions?

| username: xfworld | Original post link

No worries, practice makes perfect~ :cowboy_hat_face:

| username: kkpeter | Original post link

Is there any official team member who can come out and provide an answer? :pray:

| username: erwadba | Original post link

If the PD node port still exists, it does not affect the heartbeat between nodes and cannot be considered down.

| username: xfworld | Original post link

ETCD uses the Raft protocol by default, and the Raft protocol only supports an odd number of instances to complete the election process.

As for why Raft is designed this way, you can refer to relevant materials for more information.

| username: kkpeter | Original post link

This is similar to what we speculated, and this explanation seems quite reliable.

The current issue is whether PD should add more conditions to assess its own availability and promptly relinquish the leader position. After all, there are still two nodes available at this time, meaning the majority might be functioning normally. If the leader switch is successful, the cluster can continue to serve normally, and the recovery speed would be faster, right?

| username: songxuecheng | Original post link

In your case, it might be that the database disk is full, but the network port heartbeat is still active, preventing the leader election from being triggered and requiring manual intervention.

You can also check the logs to see if there is a leader election, to determine whether the election was not triggered or if the election was unsuccessful.

| username: tomsence | Original post link

If the disk is full and the network heartbeat is still present, it cannot be considered a crash. The best approach is still to monitor and alert, and handle it promptly.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.