【TiDB Environment】Test Environment
【TiDB Version】Both 6.1 & 5.4
【Encountered Issue】When deploying the TiDB cluster, the basic-pd-0 Pod container fails and restarts after 30 seconds. Checking the logs reveals an issue with the domain field concatenation, with the following error:
** server can't find basic-pd-0.basic-pd-peer.tidb-cluster.svc.cluster.local.basic-pd-peer.tidb.cluster.svc: NXDOMAIN
nslookup domain basic-pd-0.basic-pd-peer.tidb-cluster.svc.cluster.local.basic-pd-peer.tidb-cluster.svc failed
Using startUpScriptVersion: "v1" results in a similar error:
domain resolve basic-pd-0.basic-pd-peer.tidb-cluster.svc.cluster.local.basic-pd-peer.tidb-cluster.svc no record return
This error likely originates from the source code in manager/member/template.go at line 120 in the pdStartScriptTpl command, where there is an issue with the concatenation of the domain variable. The expected concatenation result should be:
basic-pd-0.basic-pd-peer.tidb-cluster.svc # Correct result
basic-pd-0.basic-pd-peer.tidb-cluster.svc.cluster.local.basic-pd-peer.tidb-cluster.svc # Actual incorrect result
Is this a bug? Has anyone encountered this issue before?
【TiDB Operator Version】: 1.3.7
【K8s Version】: 1.20 & containerd 1.2.10
The current K8S environment uses CoreDNS for support, and there should be no issues with the prepared environment content;
The tc.clusterDomain field is not set, and the tc configuration should be completely consistent with the quickstart.
I have now found the issue. In the manager/member/template.go file, in the pd startup script pdStartScriptTpl (after the recent GitHub update, it is in the tidb-operator/charts/tidb-cluster/templates/scripts/_start_pd.sh.tpl file), there is this line:
POD_NAME=${POD_NAME:-$HOSTNAME}
If kubelet does not inject $POD_NAME into the container, the value obtained by $HOSTNAME contains not only the container’s hostname but also many svc, namespace, etc., resulting in an incorrect domain. This line should be changed to:
POD_NAME=${POD_NAME:-$(hostname)}
After making this change, the result is now correct on my end.
Hi, can you provide your tc definition yaml and pod yaml output?
Also, kubectl -nkube-system get cm/coredns -oyaml --export
and kubectl -ntidb-test exec -ti tidb4012-pd-0 – cat /etc/resolv.conf
I couldn’t reproduce your issue locally.
I noticed the word “ali” in your YAML file. Could you please explain how your environment was created? How were the TiDB operator and TiDB cluster deployed?