TiDB PD service is always in a down state

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb pd服务一直是down的状态

| username: hacker_77powerful

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version]
[Reproduction Path] What operations were performed when the issue occurred
[Encountered Issue: Issue Phenomenon and Impact]
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachment: Screenshot/Log/Monitoring]
[FATAL] [main.go:232] [“run server failed”] [error=“[PD:server:ErrCancelStartEtcd]etcd start canceled”] [stack=“main.start
n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:232\nmain.createServerWrapper\n\t/home/jenkins/agent/worksp
ace/build-common/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:147\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.
0.0/command.go:846\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:950\ngithub.com/spf13/cobra.(*Command
).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:887\nmain.main\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/
cmd/pd-server/main.go:56\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250”]

[ERROR] [etcdutil.go:83] [“failed to get cluster from remote”] [error=“[PD:etcd:ErrEtcdGetCluster]failed to get raft cluster member(s) from the given URLs: failed to get raft cluster member(s) from the given URLs”]
[2024/04/11 17:39:13.633 +08:00] [WARN] [server.go:2098] [“failed to publish local member to cluster through raft”] [local-member-id=b43ecfd4b44129fc] [local-member-attributes=“{Name:pd-1 ClientURLs:[http://192.168.209.5:22379]}”] [request-path=/0/members/b43ecfd4b44129fc/attributes] [publish-timeout=11s] [error=“etcdserver: request timed out”]

| username: TiDBer_jYQINSnf | Original post link

Is it a new cluster? It looks like the startup parameters for PD are configured incorrectly, particularly the URL part.

| username: TiDBer_jYQINSnf | Original post link

Isn’t it usually 2379? Did you make a mistake?

| username: hacker_77powerful | Original post link

There is no mistake, we used a custom port.

| username: hacker_77powerful | Original post link

It’s not a new cluster; it has been running for a while. One of the PD nodes has been in a down state because the file system is full.

| username: TiDBer_jYQINSnf | Original post link

Can’t connect to PD, check your network. If the network is fine, execute the following in a normal PD:

member

See if it exists. If it does, use:

member delete

to delete it, then clear the data directory of this PD and rebuild it. The data volume of PD is very small, rebuilding won’t take much time.

| username: hacker_77powerful | Original post link

It might be that the PD data directory was not cleared. I’ll try again tomorrow. Thank you.

| username: dba远航 | Original post link

If there is insufficient space, it will definitely cause server anomalies. Try clearing unused files.