TiDB component pod fails to start

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb组件的pod无法启动

| username: liyuntang

[TiDB Usage Environment]
Production environment deployed in k8s, with more than 1 million data tables

[TiDB Version]
v6.5.0

[Reproduction Path]
Adjust the CPU, memory, and replica configuration of the TiDB component
Previously configured as 20, 50, 5, adjusted to 15, 32, 3. Due to the pod failing to start for a long time and reporting an OOM error, the configuration was reverted to 20, 50, 3.

[Encountered Issue: Problem Phenomenon and Impact]
The TiDB pod cannot start, causing the configuration change to fail.

[Resource Configuration]

[Attachments: Screenshots/Logs/Monitoring]

Logs:
aaa.log (17.4 KB)

Current TiDB status

Pod status
Name: basic-tidb-5
Namespace: 8b72b77e-dbb1-4f43-ba62-512230eb7668
Node: 10.40.146.12/10.40.146.12
Start Time: Mon, 03 Apr 2023 18:42:28 +0800
Labels: app.kubernetes.io/component=tidb
app.kubernetes.io/instance=basic
app.kubernetes.io/managed-by=tidb-operator
app.kubernetes.io/name=tidb-cluster
controller-revision-hash=basic-tidb-86dd5f65
statefulset.kubernetes.io/pod-name=basic-tidb-5
Annotations: cni.projectcalico.org/podIP: 36.0.4.48/32
prometheus.io/path: /metrics
prometheus.io/port: 10080
prometheus.io/scrape: true
Status: Running
IP: 36.0.4.48
IPs:
IP: 36.0.4.48
Controlled By: StatefulSet/basic-tidb
Containers:
slowlog:
Container ID: docker://d5e7b48abd7c95f9920d154f9fe96f6ed70da57deb13263ce17409414c21d81f
Image: hub.kce.ksyun.com/nosql/tidb/busybox:1.26.2
Image ID: docker-pullable://busybox@sha256:be3c11fdba7cfe299214e46edc642e09514dbb9bbefcd0d3836c05a1e0cd0642
Port:
Host Port:
Command:
sh
-c
touch /var/log/tidb/slowlog; tail -n0 -F /var/log/tidb/slowlog;
State: Running
Started: Mon, 03 Apr 2023 18:42:29 +0800
Ready: True
Restart Count: 0
Environment:
Mounts:
/var/log/tidb from slowlog (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-rdcrn (ro)
tidb:
Container ID: docker://1e9c35255840c0402b468fefbe6d9d2f22348188d7c2126065c7e02b5dd48b29
Image: hub.kce.ksyun.com/nosql/tidb/tidb:v6.5.0
Image ID: docker-pullable://hub.kce.ksyun.com/nosql/tidb/tidb@sha256:9aae3b7bac538f017dc21633fd1afe4389b53619f867d67705caaff1de21c21e
Ports: 4000/TCP, 10080/TCP
Host Ports: 0/TCP, 0/TCP
Command:
/bin/sh
/usr/local/bin/tidb_start_script.sh
State: Running
Started: Mon, 03 Apr 2023 18:42:29 +0800
Ready: False
Restart Count: 0
Limits:
cpu: 20
memory: 50Gi
Requests:
cpu: 20
memory: 50Gi
Readiness: tcp-socket :4000 delay=10s timeout=1s period=10s #success=1 #failure=3
Environment:
CLUSTER_NAME: basic
TZ: Asia/Shanghai
BINLOG_ENABLED: false
SLOW_LOG_FILE: /var/log/tidb/slowlog
POD_NAME: basic-tidb-5 (v1:metadata.name)
NAMESPACE: 8b72b77e-dbb1-4f43-ba62-512230eb7668 (v1:metadata.namespace)
HEADLESS_SERVICE_NAME: basic-tidb-peer
Mounts:
/etc/podinfo from annotations (ro)
/etc/tidb from config (ro)
/usr/local/bin from startup-script (ro)
/var/log/tidb from slowlog (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-rdcrn (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
annotations:
Type: DownwardAPI (a volume populated by information about the pod)
Items:
metadata.annotations → annotations
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: basic-tidb-3961643
Optional: false
startup-script:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: basic-tidb-3961643
Optional: false
slowlog:
Type: EmptyDir (a temporary directory that shares a pod’s lifetime)
Medium:
SizeLimit:
default-token-rdcrn:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-rdcrn
Optional: false
QoS Class: Burstable
Node-Selectors:
Tolerations:
Events:
Type Reason Age From Message


Normal Scheduled 18m default-scheduler Successfully assigned 8b72b77e-dbb1-4f43-ba62-512230eb7668/basic-tidb-5 to 10.40.146.12
Normal Pulled 18m kubelet Container image “hub.kce.ksyun.com/nosql/tidb/busybox:1.26.2” already present on machine
Normal Created 18m kubelet Created container slowlog
Normal Started 18m kubelet Started container slowlog
Normal Pulled 18m kubelet Container image “hub.kce.ksyun.com/nosql/tidb/tidb:v6.5.0” already present on machine
Normal Created 18m kubelet Created container tidb
Normal Started 18m kubelet Started container tidb
Warning Unhealthy 3m (x90 over 17m) kubelet Readiness probe failed: dial tcp 36.0.4.48:4000: connect: connection refused

The svc status of basic-tidb is as follows:

| username: tidb狂热爱好者 | Original post link

You removed the readiness.

| username: liyuntang | Original post link

Will the readiness probe affect the startup of the pod? As I understand it, the role of the readiness probe is to adjust the IP in the service’s endpoints resource, which doesn’t seem to align with my issue. The basic-tidb-5 did successfully start once midway, but its port 4000 is not reachable via telnet.

| username: liyuntang | Original post link

The image is not visible. Please provide the text content for translation.

| username: yiduoyunQ | Original post link

Method 1: Resolve TiDB OOM to make all TiDB pods ready.
Method 2: Find a way to skip TiDB pod readiness checks, such as deleting TiDB sts and starting synchronization from 0.

| username: liyuntang | Original post link

The issue has been resolved; it was due to insufficient memory allocated, causing the TiDB pod to trigger an OOM after running for a while. This problem was not evident initially, but later, upon checking the pod status, I found the exit code was 137, confirming it was an OOM issue. Increasing the memory resolved the problem.

This part should not skip the readiness probe because, although the pod has started, the TiDB server inside the pod has not (it takes about 30 minutes). If the readiness probe is skipped, business requests will be forwarded to this pod, leading to request errors and impacting the business.

| username: liyuntang | Original post link

Another question is why does the TiDB server take 30 minutes to start successfully? After analyzing the logs, I observed the following phenomenon:
Total time taken: 30 minutes

I don’t understand the purpose of the process [“refreshServerIDTTL succeed”] [serverID=2241294] [“lease id”=76628659b1e04af0] and why it takes 5 minutes each time (what is the time spent on), and it is serial (it seems that the number of serial times is related to the replica of TiDB). By analogy, if the replica of TiDB is extremely large (hundreds or thousands), does it mean that starting once would take several days?

Another point is whether this process is related to the number of tables. I tested it on two clusters: Cluster One with 10 tables and Cluster Two with 200,000 tables. The startup time for Cluster One is within 10 seconds, while for Cluster Two it is around 2 minutes.

| username: tidb狂热爱好者 | Original post link

This has already been submitted to the official by dba-kit.

| username: liyuntang | Original post link

Okay, understood.

| username: Billmay表妹 | Original post link

The issue has been fixed. Please upgrade to v6.5.1 to resolve it.

| username: dba-kit | Original post link

You can check this post, but after the 6.5.1 fix, there is still a potential issue: Although loading statistics has become asynchronous and will not affect the startup of the tidb-server, before the init stats are completed, the SQL running on it will use pseudo logic due to the lack of statistics, which may lead to the wrong index being used.

| username: dba-kit | Original post link

If using the k8s deployment method and performing a rolling upgrade for TiDB, it might actually be better for version 6.5.0 to not provide services externally before the init stats are completed.

| username: liyuntang | Original post link

Okay, tomorrow I will try version 6.5.1. Thank you, everyone.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.