TiKV Fails to Connect to PD: "validate PD endpoints failed" in Self-Managed Kubernetes Deployment

hamid · May 28, 2025, 7:51am

Hi everyone,

I’m deploying a self-managed TiDB cluster on Kubernetes (offline, no NetworkPolicy) using the basic cluster example from PingCAP’s GitHub repo, with TiDB Operator version v1.6.1.

I’m encountering an issue where TiKV cannot connect to the PD endpoint.

TiKV log snippet:

[2025/05/28 07:09:46.898 +00:00] [INFO] [util.rs:639] [“connecting to PD endpoint”] [endpoints=http://basic-pd:2379] [thread_id=1]
[2025/05/28 07:09:46.899 +00:00] [ERROR] [util.rs:634] [“validate PD endpoints failed”] [err=“Other("[components/pd_client/src/util.rs:634]: …”]

PD log shows it’s advertising the client URL as:

advertise-client-urls=“[http://basic-pd-0.basic-pd-peer.default.svc:2379]”

I’ve already tried updating the PD config to:

advertise-client-urls = “http://basic-pd:2379”

and confirmed via logs:

advertise-client-urls=“[http://basic-pd:2379]”

But the error persists.

I’ve also checked:

kubectl get svc basic-pd exposes port 2379.

From inside a TiKV pod:

    curl http://basic-pd:2379/health responds OK.

    DNS resolution works via getent hosts basic-pd.

PD cluster seems healthy when queried via etcdctl inside the PD pod.

Question:

Why does TiKV still fail with validate PD endpoints failed even though PD is reachable and advertising the correct URL?

Could leftover PVC data or internal PD metadata still be causing an endpoint mismatch?

What’s the recommended workflow to cleanly reset a TiDB cluster deployment in Kubernetes to avoid these kinds of issues?

Any advice, known bugs for v1.6.1, or debugging steps I might have missed would be greatly appreciated

TiDB-AI-Assistant · May 28, 2025, 7:53am

Certainly! Here’s a comprehensive, actionable troubleshooting guide for TiKV ‘validate PD endpoints failed’ errors in a self-managed TiDB cluster on Kubernetes, particularly taking into account issues around leftover PVC data, internal PD metadata mismatches, and TiDB Operator v1.6.1 nuances.

1. Root Cause Analysis

Background: ‘validate PD endpoints failed’ in TiKV

This error occurs when TiKV attempts to join or reconnect to the PD (Placement Driver) service but encounters endpoint verification failures. The most common root causes for this issue in Kubernetes-managed TiDB clusters are:

a) Stale PVC (Persistent Volume Claim) Data

Persistent Volume Claims are used for data storage across pod restarts and recreations.
If PVCs are not deleted after cluster teardown or before recreating a cluster with the same name, any pods (especially StatefulSet-managed ones like TiKV and PD) may reuse old data.
TiKV (and PD) stores unique cluster and store metadata on its disk (PV/PVC). If a new cluster is created but reuses an old PVC, TiKV may try to join a cluster using stale cluster IDs, resulting in endpoint mismatches and errors such as:

“cluster id mismatch”
or
“validate PD endpoints failed”
This is especially problematic if the cluster is deleted but the storage backend (PV/PVC) is set to “Retain” or not cleaned reference.

b) PD Metadata and Cluster Lifecycle

PD persists the overall cluster topology, including node addresses, store IDs, and cluster UUIDs.
If the PD state (either in the etcd backend or on its own disks) is not properly wiped, a new deployment may clash with leftover topology information, causing TiKV/PD registration validation to fail.
Re-deployments—using existing data but intending to “start fresh”—will fail due to these mismatches if clean-up is incomplete.

2. Step-by-Step Workflow for Clean, Effective Cluster Reset

To avoid cluster ID or endpoint issues, always follow a thorough cleanup and reinitialization process when tearing down or redeploying a TiDB cluster. Here is a safe, best-practice workflow:

a) Scale Down and Delete the TiDB Cluster

Scale down all components (PD, TiKV, TiDB, TiFlash, etc.) to zero prior to deletion for a clean shutdown (optional but recommended):

kubectl patch tc <cluster-name> -n <namespace> --type merge -p '{"spec":{"pd":{"replicas":0},"tikv":{"replicas":0},"tidb":{"replicas":0}}}'

Delete the TidbCluster resource:

kubectl delete tc <cluster-name> -n <namespace>

Optionally, delete monitoring and backup/restore resources referencing this cluster for full cleanup.

b) Remove Stateful Data: PVC and PV

Critical step: make sure to delete all PVCs related to the cluster, especially for PD and TiKV:

List and delete all cluster PVCs (replace <namespace> and <cluster-name>):
```
kubectl delete pvc -n <namespace> -l app.kubernetes.io/instance=<cluster-name>
```
Ensure to remove all role-specific claims for PD, TiKV, TiFlash, etc.

Optionally patch and then delete PVs if the reclaim policy is not “Delete”:

kubectl get pv -l app.kubernetes.io/instance=<cluster-name> -o name | \
  xargs -I {} kubectl patch {} -p '{"spec":{"persistentVolumeReclaimPolicy":"Delete"}}'

Then, delete the PVs manually if needed.

c) Remove the Namespace (if not used for other clusters)

For an absolutely fresh start, delete the entire namespace:
```
kubectl delete namespace <namespace>
```
This cleans up any stray Kubernetes resources such as Services, StatefulSets, or CRDs.

d) Re-deploy the TiDB Cluster

After verifying all PVCs and PVs are gone, re-create the TidbCluster resource as usual.
This ensures each PD and TiKV pod gets a new, blank disk and cluster/store metadata is initialized accordingly.

Reference: Official guide on cluster destroy and cleanup

3. Version-Specific Tips, Debugging, and Prevention for v1.6.1

Key Recommendations for TiDB Operator v1.6.1

There are no breaking bugs explicitly documented in v1.6.1 for endpoint validation, but previous patterns apply.
Newer versions have refined rolling update and config update mechanisms—always ensure PVC/PV cleanup before reusing cluster names.

Endpoint Validation Debugging Steps

Check TiKV Pod logs to confirm the exact error:
```
kubectl -n <namespace> logs <tikv-pod-name>
```
Look for “validate PD endpoints failed”, “cluster id mismatch”, or other relevant errors.
Verify PD health and endpoints:
Exec into a TiKV pod or any available pod and try:
```
curl http://<pd-service>:2379/pd/api/v1/members
```
Confirm that the PD endpoints are correct and reachable from within the cluster.
Double-check PVC usage:
- Use kubectl get pvc,pv -n <namespace> and confirm that there are no leftover volumes from prior clusters.
Inspect cluster metadata:
If you see store/cluster ID mismatches, it’s almost always due to old data on the underlying PVs/PVCs, not a networking or configuration fault.
Confirm unique hostnames and endpoints:
Avoid running multiple clusters with the same name in the same namespace unless you’re using separate PVCs/resources.

Cluster Reinstall/Reset Tips

NEVER just delete pods or only StatefulSets; always also clean up PVCs/PVs before redeploying with the same cluster name.
When using local PVs, ensure the underlying disk is formatted or cleared after PVC deletion, or Kubernetes might re-mount old data.
If using an external storage backend (i.e., NFS, cloud disk), ensure any “Retain” policy volumes are deleted manually if not needed.

Operator/PD-specific nuances

If the cluster fails with DNS-related errors, try setting spec.pd.startUpScriptVersion: "v1" in your TidbCluster spec to use dig rather than nslookup for endpoint discovery, as recommended for newer PD/TiDB versions see.
Always check that the advertised PD URLs in the TidbCluster CR match the DNS records exposed by Services and StatefulSets in Kubernetes.

Summary Table — Troubleshooting and Clean Reset

Action	Command/Step	Notes
Scale down cluster components	`kubectl patch tc <name>...`	Graceful shutdown (optional but safe).
Delete TidbCluster and related resources	`kubectl delete tc <name>...`	Must do before cleanup.
Delete all PVCs and (optionally) PVs	`kubectl delete pvc -n <ns> -l app.kubernetes.io/instance=<name>`	Critical for avoiding stale data issues.
Patch/set PV reclaim policy if needed	See above	Required if PV reclaim policy was “Retain”.
Remove Namespace	`kubectl delete namespace <name>`	If you want absolutely no residue.
Re-deploy cluster	Apply usual manifests/Helm	Clean initialization—no remnant data left.

Conclusion: Preventing Future Endpoint Mismatches

Never re-use existing PVCs/PVs with a new, unrelated cluster deployment—always delete or replace volumes to avoid “validate PD endpoints failed” or “cluster id mismatch” errors.
Automate cleanup as part of your cluster destroy/initialize pipeline.
Before reusing a cluster name or namespace, always check for lingering PVCs or PVs.

By following the above analysis and workflow, you will ensure that all TiDB/TiKV/PD components initialize cleanly, preventing endpoint validation errors rooted in data residue.

Hazel · May 28, 2025, 8:55am

Hi @hamid could you try adding this env variable to the TiKV on your TiDB cluster and see if it helps resolve the issue?

env:

name: GRPC_DNS_RESOLVER
value: native

For your reference: tikv start with error: connect to pd failed, but check network is ok. · Issue #5372 · pingcap/tidb-operator · GitHub

hamid · May 29, 2025, 3:24am

Dear @Hazel for your help