Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: k8s 上tiflash不断重启
[Version] OS: 4.19.90-17.ky10.aarch64, k8s: v1.24.9, operator: 1.4.0, tidb 6.1.2
dyrnq/local-volume-provisioner:v2.5.0 (POD timezone not adjusted)
Configuration:
tiflash:
baseImage: 10.xxxx/zongbu-sre/tiflash-arm64:v6.1.2
replicas: 3
limits:
cpu: 12000m
memory: 16Gi
imagePullPolicy: IfNotPresent
storageClaims:
- resources:
requests:
storage: 500Gi
storageClassName: tiflash-storage
Status:
tidb-test-cluster-tiflash-0 4/4 Running 26 (6m28s ago) 3h36m
tidb-test-cluster-tiflash-1 4/4 Running 27 (5m59s ago) 3h36m
tidb-test-cluster-tiflash-2 4/4 Running 25 (11m ago) 3h36m
Logs:
previous.txt (620.3 KB)
current.log (620.3 KB)
I didn’t see any exit-related logs in the logs. Can you describe it to see why?
There isn’t much information. After posting, it stabilized for about 20 minutes.
tidb-test-cluster-tiflash-0 4/4 Running 26 (16m ago) 3h46m
tidb-test-cluster-tiflash-1 4/4 Running 27 (16m ago) 3h46m
tidb-test-cluster-tiflash-2 4/4 Running 25 (21m ago) 3h46m
Status: Running
IP: 172.16.228.157
IPs:
IP: 172.16.228.157
Controlled By: StatefulSet/tidb-test-cluster-tiflash
Init Containers:
init:
Container ID: containerd://3067f4400a71ad11516856dccdb2730aee11acf59209fb410e7c1a30f975c937
Image: 10.172.49.246/zongbu-sre/alpine-arm64:3.17.0
Image ID: 10.172.49.246/zongbu-sre/alpine-arm64@sha256:af06af3514c44a964d3b905b498cf6493db8f1cde7c10e078213a89c87308ba0
Port: <none>
Host Port: <none>
Command:
sh
-c
set -ex;ordinal=`echo ${POD_NAME} | awk -F- '{print $NF}'`;sed s/POD_NUM/${ordinal}/g /etc/tiflash/config_templ.toml > /data0/config.toml;sed s/POD_NUM/${ordinal}/g /etc/tiflash/proxy_templ.toml > /data0/proxy.toml
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 30 Jan 2023 11:26:57 +0800
Finished: Mon, 30 Jan 2023 11:26:57 +0800
Ready: True
Restart Count: 0
Environment:
POD_NAME: tidb-test-cluster-tiflash-2 (v1:metadata.name)
Mounts:
/data0 from data0 (rw)
/etc/tiflash from config (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ccks8 (ro)
Containers:
tiflash:
Container ID: containerd://4f01a3705a5e355f152addc929e5d71b8608212dedc2a60845c8166ab24358a7
Image: 10.172.49.246/zongbu-sre/tiflash-arm64:v6.1.2
Image ID: 10.172.49.246/zongbu-sre/tiflash-arm64@sha256:96f39d55b339c1b9e61f09fe8c8d6e0ef69add8557a0cf77a4b340d561f8c0aa
Ports: 3930/TCP, 20170/TCP, 9000/TCP, 8123/TCP, 9009/TCP, 8234/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
Command:
/bin/sh
-c
/tiflash/tiflash server --config-file /data0/config.toml
State: Running
Started: Mon, 30 Jan 2023 14:57:11 +0800
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Mon, 30 Jan 2023 14:51:26 +0800
Finished: Mon, 30 Jan 2023 14:52:03 +0800
Ready: True
Restart Count: 25
Limits:
cpu: 12
memory: 16Gi
Requests:
cpu: 12
memory: 16Gi
Environment:
NAMESPACE: default (v1:metadata.namespace)
CLUSTER_NAME: tidb-test-cluster
HEADLESS_SERVICE_NAME: tidb-test-cluster-tiflash-peer
CAPACITY: 0
TZ: UTC
Mounts:
/data0 from data0 (rw)
/etc/podinfo from annotations (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ccks8 (ro)
serverlog:
Container ID: containerd://3b48f767a9a2fcb893ce14902a2f822d2dc287cc2760fed926f9b4c202347eb6
Image: 10.172.49.246/zongbu-sre/alpine-arm64:3.17.0
Image ID: 10.172.49.246/zongbu-sre/alpine-arm64@sha256:af06af3514c44a964d3b905b498cf6493db8f1cde7c10e078213a89c87308ba0
Port: <none>
Host Port: <none>
Command:
sh
-c
touch /data0/logs/server.log; tail -n0 -F /data0/logs/server.log;
State: Running
Started: Mon, 30 Jan 2023 11:27:07 +0800
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/data0 from data0 (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ccks8 (ro)
errorlog:
Container ID: containerd://072de3619af7b33017fae83e57253b56b464eb6b1d8e45d4fc95838bcd624cbb
Image: 10.172.49.246/zongbu-sre/alpine-arm64:3.17.0
Image ID: 10.172.49.246/zongbu-sre/alpine-arm64@sha256:af06af3514c44a964d3b905b498cf6493db8f1cde7c10e078213a89c87308ba0
Port: <none>
Host Port: <none>
Command:
sh
-c
touch /data0/logs/error.log; tail -n0 -F /data0/logs/error.log;
State: Running
Started: Mon, 30 Jan 2023 11:27:07 +0800
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/data0 from data0 (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ccks8 (ro)
clusterlog:
Container ID: containerd://3c9fa1dc607d73c78ab72bd380c5bf6121cde586875c4963b41e9d4293579685
Image: 10.172.49.246/zongbu-sre/alpine-arm64:3.17.0
Image ID: 10.172.49.246/zongbu-sre/alpine-arm64@sha256:af06af3514c44a964d3b905b498cf6493db8f1cde7c10e078213a89c87308ba0
Port: <none>
Host Port: <none>
Command:
sh
-c
touch /data0/logs/flash_cluster_manager.log; tail -n0 -F /data0/logs/flash_cluster_manager.log;
State: Running
Started: Mon, 30 Jan 2023 11:27:07 +0800
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/data0 from data0 (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ccks8 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
data0:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: data0-tidb-test-cluster-tiflash-2
ReadOnly: false
annotations:
Type: DownwardAPI (a volume populated by information about the pod)
Items:
metadata.annotations -> annotations
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: tidb-test-cluster-tiflash-3336363
Optional: false
kube-api-access-ccks8:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Created 84m (x14 over 3h45m) kubelet Created container tiflash
Normal Started 84m (x14 over 3h45m) kubelet Started container tiflash
Warning BackOff 20m (x307 over 150m) kubelet Back-off restarting failed container
Normal Pulled 15m (x25 over 3h9m) kubelet Container image "10.172.49.246/zongbu-sre/tiflash-arm64:v6.1.2" already present on machine
After stabilizing and rerunning the TPC-H test, it was observed that the system restarted due to OOM. The initial restart was likely also due to OOM issues, but it is unclear why there was a prolonged crash loopback in between. The phenomenon was that after a long period of crash loopback, the system would run for a few minutes before entering another crash loop.
tidb-test-cluster-tiflash-2 3/4 OOMKilled 25 (24m ago) 3h49m
tidb-test-cluster-tiflash-2 4/4 Running 26 (2s ago) 3h49m
tidb-test-cluster-tiflash-0 3/4 OOMKilled 26 (19m ago) 3h50m
tidb-test-cluster-tiflash-0 4/4 Running 27 (1s ago) 3h50m
tidb-test-cluster-tiflash-1 3/4 OOMKilled 27 (22m ago) 3h53m
OOM occurred, increase the memory size.
You need to check the TiFlash logs for this. The logs you posted don’t contain relevant information. You can try increasing the memory first or limiting TiFlash’s memory usage.
I increased the memory, and indeed there is no OOM, everything is normal. It’s just that previously it kept crash looping, and I don’t know why.
I have a 2T TiFlash, which failed and couldn’t start again after restarting. The logs show it reaches a fatal point during startup. It hasn’t been resolved yet, so I temporarily added another TiFlash to solve it.
I don’t know much about TiFlash’s code, but if it keeps crash looping, you can check the logs for any clues.
This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.