TiDB Cluster Deployed by TiOperator Shows Pending PODs

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiOperator部署的TIDB集群出现Pending的POD

| username: TiDBer_G64jJ9u8

[TiDB Usage Environment] Testing
[TiDB Version] 6.5.0
[Reproduction Path] After running under system stress for a period, PD encountered a failure, and TIKV logs reported errors.
[Encountered Issue: Symptoms and Impact] Observed TiDB Pending and PD Pending through K8S kubelet.
[Resource Configuration] 34 cores, SATA disk
[Attachments: Screenshots/Logs/Monitoring]

| username: tidb菜鸟一只 | Original post link

Check the logs.

| username: yiduoyunQ | Original post link

  1. It is suspected that the cluster resources are insufficient to support the load, causing tidb/pd/tikv to continuously restart, as seen in the screenshot showing the number of restarts.
  2. The default configuration enables auto failover. The automatically scaled pods are in a pending state, indicating that there are not enough node resources. Refer to Kubernetes 上的 TiDB 常见部署错误 | PingCAP 文档中心
| username: cassblanca | Original post link

Will SATA drives undermine TiDB and prevent it from leveraging its advantages?

| username: TiDBer_G64jJ9u8 | Original post link

In reality, we analyzed that K8S considers the POD to be normal and in a running state. However, the tidb-controller-manager deployed based on TiOperator automatically detects that the service is actually unavailable, thus triggering a POD restart. As a result, the PV and PVC have already been bound and occupied, causing the POD to be in a pending state.

| username: yiduoyunQ | Original post link

TiOperator will not proactively trigger pod restarts.

| username: TiDBer_G64jJ9u8 | Original post link

One of the verification environments encountered an issue with a node failure, and the problem reappeared:

[root@kylin-122 global]# kubectl get pods -Aowide | grep basic
my-space   basic-discovery-8568cffbf9-fnxc6           1/1     Running            2          12d     100.84.119.105    kylin-121   <none>           <none>
my-space   basic-pd-0                                 1/1     Running            500        12d     100.84.119.102    kylin-121   <none>           <none>
***my-space   basic-pd-1                                 0/1     Pending            0          2d9h    <none>            <none>      <none>           <none>***
my-space   basic-pd-2                                 1/1     Terminating        0          12d     100.109.240.226   kylin-123   <none>           <none>
***my-space   basic-pd-3                                 0/1     Pending            0          6d17h   <none>            <none>      <none>           <none>***
my-space   basic-tidb-0                               1/2     Running            418        2d9h    100.68.55.65      kylin-122   <none>           <none>
my-space   basic-tidb-1                               1/2     Running            423        12d     100.84.119.115    kylin-121   <none>           <none>
my-space   basic-tidb-2                               2/2     Terminating        0          12d     100.109.240.218   kylin-123   <none>           <none>
my-space   basic-tidb-3                               0/2     Pending            0          6d17h   <none>            <none>      <none>           <none>
my-space   basic-tikv-0                               1/1     Running            2          12d     100.84.119.95     kylin-121   <none>           <none>
my-space   basic-tikv-1                               1/1     Terminating        0          12d     100.109.240.223   kylin-123   <none>           <none>
my-space   basic-tikv-2                               1/1     Running            0          2d9h    100.68.55.84      kylin-122   <none>           <none>
[root@kylin-122 global]# kubectl get pvc -n my-space -owide | grep tidb
NAME                   STATUS    VOLUME                                   CAPACITY   ACCESS MODES   STORAGECLASS                  AGE    
pd-basic-pd-0          Bound     pv-local-tidb-pd-my-space-kylin-121   1Gi        RWO            tidb-pd-storage-my-space   16d    
***pd-basic-pd-1          Pending                                                                   tidb-pd-storage-my-space   2d9h***   
pd-basic-pd-2          Bound     pv-local-tidb-pd-my-space-kylin-123   1Gi        RWO            tidb-pd-storage-my-space   29d    
***pd-basic-pd-3          Pending                                                                   tidb-pd-storage-my-space   16d***    
tikv-basic-tikv-0      Bound     pv-local-tidb-kv-my-space-kylin-121   20Gi       RWO            tidb-kv-storage-my-space   29d    
tikv-basic-tikv-1      Bound     pv-local-tidb-kv-my-space-kylin-123   20Gi       RWO            tidb-kv-storage-my-space   29d    
tikv-basic-tikv-2      Bound     pv-local-tidb-kv-my-space-kylin-122   20Gi       RWO            tidb-kv-storage-my-space   29d    
tikv-basic-tikv-3      Pending                                                                   tidb-kv-storage-my-space   28d    
[root@kylin-122 global]# kubectl get pv -owide | grep tidb
NAME                                     CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                              STORAGECLASS                  REASON   AGE   
pv-local-tidb-kv-my-space-kylin-121   20Gi       RWO            Retain           Bound      my-space/tikv-basic-tikv-0      tidb-kv-storage-my-space            29d   
pv-local-tidb-kv-my-space-kylin-122   20Gi       RWO            Retain           Bound      my-space/tikv-basic-tikv-2      tidb-kv-storage-my-space            29d   
pv-local-tidb-kv-my-space-kylin-123   20Gi       RWO            Retain           Bound      my-space/tikv-basic-tikv-1      tidb-kv-storage-my-space            29d   
pv-local-tidb-pd-my-space-kylin-121   1Gi        RWO            Retain           Bound      my-space/pd-basic-pd-0          tidb-pd-storage-my-space            29d   
***pv-local-tidb-pd-my-space-kylin-122   1Gi        RWO            Retain           Released   my-space/pd-basic-pd-1          tidb-pd-storage-my-space            29d***   
pv-local-tidb-pd-my-space-kylin-123   1Gi        RWO            Retain           Bound      my-space/pd-basic-pd-2          tidb-pd-storage-my-space            29d   
[root@kylin-122 global]# kubectl describe pvc -n my-space pd-basic-pd-1
Name:          pd-basic-pd-1
Namespace:     my-space
StorageClass:  tidb-pd-storage-my-space
Status:        Pending
Volume:
Labels:        app.kubernetes.io/component=pd
               app.kubernetes.io/instance=basic
               app.kubernetes.io/managed-by=tidb-operator
               app.kubernetes.io/name=tidb-cluster
Annotations:   <none>
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Used By:       basic-pd-1
Events:
  Type    Reason               Age                     From                         Message
  ----    ------               ----                    ----                         -------
  ***Normal  WaitForPodScheduled  54s (x13822 over 2d9h)  persistentvolume-controller  waiting for pod basic-pd-1 to be scheduled***
[root@kylin-122 global]# kubectl describe pods -n my-space basic-pd-1
Name:           basic-pd-1
Namespace:      my-space
Priority:       0
Node:           <none>
Labels:         app.kubernetes.io/component=pd
                app.kubernetes.io/instance=basic
                app.kubernetes.io/managed-by=tidb-operator
                app.kubernetes.io/name=tidb-cluster
                controller-revision-hash=basic-pd-648c74f97f
                statefulset.kubernetes.io/pod-name=basic-pd-1
Annotations:    kubernetes.io/limit-ranger: LimitRanger plugin set: cpu, memory request for container pd
                prometheus.io/path: /metrics
                prometheus.io/port: 2379
                prometheus.io/scrape: true
Status:         Pending
IP:
IPs:            <none>
Controlled By:  StatefulSet/basic-pd
Containers:
  pd:
    Image:       pingcap/pd:v4.0.0
    Ports:       2380/TCP, 2379/TCP
    Host Ports:  0/TCP, 0/TCP
    Command:
      /bin/sh
      /usr/local/bin/pd_start_script.sh
    Requests:
      cpu:     50m
      memory:  40Mi
    Environment:
      NAMESPACE:          my-space (v1:metadata.namespace)
      PEER_SERVICE_NAME:  basic-pd-peer
      SERVICE_NAME:       basic-pd
      SET_NAME:           basic-pd
      TZ:                 UTC
    Mounts:
      /etc/pd from config (ro)
      /etc/podinfo from annotations (ro)
      /usr/local/bin from startup-script (ro)
      /var/lib/pd from pd (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-rncxm (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  pd:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  pd-basic-pd-1
    ReadOnly:   false
  annotations:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.annotations -> annotations
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      basic-pd
    Optional:  false
  startup-script:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      basic-pd
    Optional:  false
  default-token-rncxm:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-rncxm
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  node-role.kubernetes.io/master=
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  60s (x2308 over 2d9h)  default-scheduler  0/3 nodes are available: 1 node(s) had taint {node.kubernetes.io/unreachable: }, ***that the pod didn't tolerate, 2 node(s) didn't find available persistent volumes to bind.***
| username: TiDBer_G64jJ9u8 | Original post link

Among them:
my-space basic-pd-2 1/1 Terminating 0 12d 100.109.240.226 kylin-123
my-space basic-tidb-2 2/2 Terminating 0 12d 100.109.240.218 kylin-123
my-space basic-tikv-1 1/1 Terminating 0 12d 100.109.240.223 kylin-123

This node has a physical fault.

| username: yilong | Original post link

It looks like one node failed, and the other nodes also don’t have available volumes. You can try fixing this node first and then check again.

| username: TiDBer_G64jJ9u8 | Original post link

This violates the basic high-reliability requirement of a 3-node cluster. A single node failure should absolutely not affect the functionality of the cluster. Otherwise, what’s the point of using a cluster? It’s a waste of resources.

Regarding the issue above, it can basically be suspected that there were problems in binding PVC, PV, and unbinding PVC, PV.

| username: redgame | Original post link

If a PD node fails, you can try restarting the PD node.