The PD leader cannot be found, causing the cluster to be unavailable

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: pd的leader找不到,导致集群不可用

| username: 中国电信TIKV

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version]
[Reproduction Path] What operations were performed when the issue occurred
In the architecture of juicefs+tikv, during daily task runs, the PD leader could not be found, causing the cluster to be unavailable.
[Encountered Issue: Problem Phenomenon and Impact]
PD log error: load from etcd meet error, the first error occurred at 11:29, followed by a period of memory and coroutine surge, and finally, it could not recover automatically and had to manually restart the cluster.
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

c1f49cd15f892f95b88bd9dfa01421e

| username: 中国电信TIKV | Original post link

This has happened many times in the quasi-production environment.

| username: 中国电信TIKV | Original post link

Since we saw that v7.1.0 updated this bug, we upgraded to this version. Now it seems that this cannot solve our problem.

| username: Billmay表妹 | Original post link

Enter TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page.

| username: 中国电信TIKV | Original post link

Thank you for your reply, cousin. However, we are currently unable to log in to the dashboard due to security issues and need to apply for access. Which specific monitoring graph would you like to see? I can take a screenshot from Grafana for you.

| username: pingyu | Original post link

The logs show etcd read/write timeout. Check if the machine’s CPU or hard disk is overwhelmed. The subsequent memory and goroutine surge might be caused by a cascading failure due to the earlier timeout.

| username: 中国电信TIKV | Original post link

These are some other logs, namely pd.log and pd_stderr.log. We will start continuous analysis and provide feedback if we capture anything further.

| username: Kongdom | Original post link

You can check it this way as well. It won’t be a mixed deployment, right?

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.