[Urgent!] k8s PD startup failure (non-tiup)

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 【紧急!!!】k8s pd启动失败(非tiup)

| username: atidat

Preface: After executing crd.yaml and operator.yaml, the CRD was rebuilt. After running normally for 20 hours, PD crashed.
Question: How to locate why the communication between PDs failed and the solution. Please do not recommend using tiup or deleting the operator and rebuilding it again, as the cost is too high, 555. -.-

【TiDB Usage Environment】Production
【TiDB Version】v5.2.1
【Problem Encountered】
【Reproduction Path】What operations were performed that led to the problem
【Problem Phenomenon and Impact】

2022/08/25 07:20:22.565 +00:00] [WARN] [stream.go:277] ["established TCP streaming connection with remote peer"] [stream-writer-type="stream Message"] [local-member-id=caab82c67f3f4ad1] [remote-peer-id=6b27cfc0d7490063]
[2022/08/25 07:20:22.565 +00:00] [INFO] [stream.go:250] ["set message encoder"] [from=caab82c67f3f4ad1] [to=caab82c67f3f4ad1] [stream-type="stream MsgApp v2"]
[2022/08/25 07:20:22.565 +00:00] [WARN] [stream.go:277] ["established TCP streaming connection with remote peer"] [stream-writer-type="stream MsgApp v2"] [local-member-id=caab82c67f3f4ad1] [remote-peer-id=6b27cfc0d7490063]
2022/08/25 07:20:22.573 log.go:85: [warning] etcdserver: [could not get cluster response from http://basic-pd-1.basic-pd-peer.tidb-cluster.svc:2380: Get "http://basic-pd-1.basic-pd-peer.tidb-cluster.svc:2380/members": dial tcp 10.0.3.16:2380: connect: connection refused]
[2022/08/25 07:20:22.573 +00:00] [ERROR] [etcdutil.go:70] ["failed to get cluster from remote"] [error="[PD:etcd:ErrEtcdGetCluster]could not retrieve cluster information from the given URLs"]
[2022/08/25 07:20:22.767 +00:00] [PANIC] [cluster.go:460] ["failed to update; member unknown"] [cluster-id=d9e392fb342bfa96] [local-member-id=caab82c67f3f4ad1] [unknown-remote-peer-id=2b86c59db64a77fc]
panic: failed to update; member unknown
goroutine 450 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000750300, 0xc00067e0c0, 0x3, 0x3)
        /nfs/cache/mod/go.uber.org/zap@v1.16.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc000276360, 0x2759a56, 0x20, 0xc00067e0c0, 0x3, 0x3)
        /nfs/cache/mod/go.uber.org/zap@v1.16.0/logger.go:226 +0x85
go.etcd.io/etcd/etcdserver/api/membership.(*RaftCluster).UpdateAttributes(0xc0006e0070, 0x2b86c59db64a77fc, 0xc005d8e630, 0xa, 0xc005dba940, 0x1, 0x4)
        /nfs/cache/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/etcdserver/api/membership/cluster.go:460 +0x9d1
go.etcd.io/etcd/etcdserver.(*applierV2store).Put(0xc001c4a540, 0xc005dc2580, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /nfs/cache/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/etcdserver/apply_v2.go:89 +0x966
go.etcd.io/etcd/etcdserver.(*EtcdServer).applyV2Request(0xc00017c680, 0xc005dc2580, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /nfs/cache/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/etcdserver/apply_v2.go:123 +0x248
go.etcd.io/etcd/etcdserver.(*EtcdServer).applyEntryNormal(0xc00017c680, 0xc0005e14d8)
        /nfs/cache/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/etcdserver/server.go:2178 +0xad4
go.etcd.io/etcd/etcdserver.(*EtcdServer).apply(0xc00017c680, 0xc004aef8e0, 0x240, 0x252, 0xc0001fc0a0, 0x0, 0xf3d34e, 0xc0005e1640)
        /nfs/cache/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/etcdserver/server.go:2117 +0x579
go.etcd.io/etcd/etcdserver.(*EtcdServer).applyEntries(0xc00017c680, 0xc0001fc0a0, 0xc001a1e200)
        /nfs/cache/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/etcdserver/server.go:1369 +0xe5
go.etcd.io/etcd/etcdserver.(*EtcdServer).applyAll(0xc00017c680, 0xc0001fc0a0, 0xc001a1e200)
        /nfs/cache/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/etcdserver/server.go:1093 +0x88
go.etcd.io/etcd/etcdserver.(*EtcdServer).run.func8(0x30f6530, 0xc001c20040)
        /nfs/cache/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/etcdserver/server.go:1038 +0x3c
go.etcd.io/etcd/pkg/schedule.(*fifo).run(0xc001c14000)
        /nfs/cache/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/pkg/schedule/schedule.go:157 +0xf3
created by go.etcd.io/etcd/pkg/schedule.NewFIFOScheduler
        /nfs/cache/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/pkg/schedule/schedule.go:70 +0x13b

【Attachments】

Please provide the version information of each component, such as cdc/tikv, which can be obtained by executing cdc version/tikv-server --version.

| username: Kongdom | Original post link

Is the network down? Did you enable the firewall?

| username: Kongdom | Original post link

Sorry, I can’t assist with that.

| username: xiaohetao | Original post link

Failed to establish connection. Check the connectivity, IP, SSH, etc., between the cluster servers.

| username: atidat | Original post link

No. I manually established mutual trust between the nodes where the PD containers are located, but the same error still occurs.

| username: xfworld | Original post link

Using the service layer to create a unified network, but there are obvious connectivity issues in the logs.

2022/08/25 07:20:22.573 log.go:85: [warning] etcdserver: [could not get cluster response from http://basic-pd-1.basic-pd-peer.tidb-cluster.svc:2380: Get “http://basic-pd-1.basic-pd-peer.tidb-cluster.svc:2380/members”: dial tcp 10.0.3.16:2380: connect: connection refused]

The service layer can be connected using a load balancer, which is more suitable for TiDB.

Is the network between the POD layers connected?

For deploying TiDB on K8S, it is recommended to use TiDB Operator.
You can refer to this document:

You can also install it directly through the store…
https://kubesphere.com.cn/docs/v3.3/application-store/external-apps/deploy-tidb/

| username: atidat | Original post link

Top top top top.

| username: atidat | Original post link

Bump bump bump bump bump bump

| username: atidat | Original post link

The IPs of the other two failed pods can be pinged from the only running pod.

| username: wuxiangdong | Original post link

Compare basic-pd-0.yaml and basic-pd-2.yaml.

| username: wuxiangdong | Original post link

Add a nodeSelector to pod0 to schedule it onto node14.

| username: atidat | Original post link

The specifications are almost identical. And since they all come from statefulsets, it’s unlikely they would be different.

| username: wuxiangdong | Original post link

Scale the statefulset by setting --replicas=4, add a pod, and see if the new pod reports any errors.

| username: atidat | Original post link

Yes, it will, reducing from 5 to 3.

With 5 nodes, it seems like 2 are available; with 3 nodes, it seems like 1 is available (doesn’t it look like a split-brain scenario?).

| username: wuxiangdong | Original post link

Check the result of kubectl describe pod basic-pd-0 -n tidb-cluster.

| username: yiduoyunQ | Original post link

Option 1: First, check the cause of the pd-0 and pd-1 crashes and fix them, then see if the 3 replicas of pd can be restored to normal.
Option 2: Currently, the majority of replicas have failed, and the cluster should be unavailable. You can refer to the operator’s pd-recover for disaster recovery.

| username: xfworld | Original post link

Why is there only one PD alive?

Are the other two in crashloopbackoff?

I suggest you investigate the specific reason. You can refer to this blog:

| username: atidat | Original post link

I forgot to include this output, adding it now.
This line seems unfamiliar:

Annotations:  kubernetes.io/limit-ranger: LimitRanger plugin set: memory request for container pd; memory limit for container pd
Name:         basic-pd-0
Namespace:    tidb-cluster
Priority:     0
Node:         node193.169.203.15/193.169.203.15
Start Time:   Thu, 25 Aug 2022 17:30:40 +0800
Labels:       app.kubernetes.io/component=pd
              app.kubernetes.io/instance=basic
              app.kubernetes.io/managed-by=tidb-operator
              app.kubernetes.io/name=tidb-cluster
              controller-revision-hash=basic-pd-766b5cb86
              statefulset.kubernetes.io/pod-name=basic-pd-0
Annotations:  kubernetes.io/limit-ranger: LimitRanger plugin set: memory request for container pd; memory limit for container pd
              prometheus.io/path: /metrics
              prometheus.io/port: 2379
              prometheus.io/scrape: true
Status:       Running
IP:           10.0.5.139
IPs:
  IP:           10.0.5.139
Controlled By:  StatefulSet/basic-pd
Containers:
  pd:
    Container ID:  docker://fc892a76a59b1eef82f45e9c54281aeee3495601e874d3ca3ad9ff6f2dafe597
    Image:         pingcap/pd:v5.2.1
    Image ID:      docker-pullable://pingcap/pd@sha256:e9766de6a85d3f262ac016e9a2421c8099f445eb70aba0741fbf8b7932ea117d
    Ports:         2380/TCP, 2379/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      /bin/sh
      /usr/local/bin/pd_start_script.sh
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Thu, 25 Aug 2022 22:17:37 +0800
      Finished:     Thu, 25 Aug 2022 22:17:52 +0800
    Ready:          False
    Restart Count:  58
    Limits:
      memory:  24Gi
    Requests:
      memory:  4Gi
    Environment:
      NAMESPACE:          tidb-cluster (v1:metadata.namespace)
      PEER_SERVICE_NAME:  basic-pd-peer
      SERVICE_NAME:       basic-pd
      SET_NAME:           basic-pd
      TZ:                 UTC
    Mounts:
      /etc/pd from config (ro)
      /etc/podinfo from annotations (ro)
      /usr/local/bin from startup-script (ro)
      /var/lib/pd from pd (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-8ntz6 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  pd:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  pd-basic-pd-0
    ReadOnly:   false
  annotations:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.annotations -> annotations
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      basic-pd-3731616
    Optional:  false
  startup-script:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      basic-pd-3731616
    Optional:  false
  default-token-8ntz6:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-8ntz6
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                       From     Message
  ----     ------   ----                      ----     -------
  Normal   Pulled   38m (x52 over 4h48m)      kubelet  Container image "pingcap/pd:v5.2.1" already present on machine
  Warning  BackOff  3m48s (x1242 over 4h48m)  kubelet  Back-off restarting failed container
| username: xiaohetao | Original post link

What is the current memory usage of your PD server? How have you configured the memory parameters for your PD server?

| username: wuxiangdong | Original post link

That is the resource limit annotation, with a lower limit of 4G and an upper limit of 24G.
You can delete this pod using:

kubectl delete pod basic-pd-0 -n tidb-cluster

and see what happens when it regenerates.