TiFlash on k8s keeps restarting

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: k8s 上tiflash不断重启

| username: h5n1

[Version] OS: 4.19.90-17.ky10.aarch64, k8s: v1.24.9, operator: 1.4.0, tidb 6.1.2
dyrnq/local-volume-provisioner:v2.5.0 (POD timezone not adjusted)

Configuration:

  tiflash:
    baseImage: 10.xxxx/zongbu-sre/tiflash-arm64:v6.1.2
    replicas: 3
    limits:
      cpu: 12000m
      memory: 16Gi
    imagePullPolicy: IfNotPresent
    storageClaims:
      - resources:
          requests:
            storage: 500Gi
        storageClassName: tiflash-storage

Status:

tidb-test-cluster-tiflash-0                    4/4     Running   26 (6m28s ago)   3h36m
tidb-test-cluster-tiflash-1                    4/4     Running   27 (5m59s ago)   3h36m
tidb-test-cluster-tiflash-2                    4/4     Running   25 (11m ago)     3h36m

Logs:

previous.txt (620.3 KB)

current.log (620.3 KB)

| username: TiDBer_jYQINSnf | Original post link

I didn’t see any exit-related logs in the logs. Can you describe it to see why?

| username: h5n1 | Original post link

There isn’t much information. After posting, it stabilized for about 20 minutes.

tidb-test-cluster-tiflash-0                    4/4     Running   26 (16m ago)   3h46m
tidb-test-cluster-tiflash-1                    4/4     Running   27 (16m ago)   3h46m
tidb-test-cluster-tiflash-2                    4/4     Running   25 (21m ago)   3h46m
Status:       Running
IP:           172.16.228.157
IPs:
  IP:           172.16.228.157
Controlled By:  StatefulSet/tidb-test-cluster-tiflash
Init Containers:
  init:
    Container ID:  containerd://3067f4400a71ad11516856dccdb2730aee11acf59209fb410e7c1a30f975c937
    Image:         10.172.49.246/zongbu-sre/alpine-arm64:3.17.0
    Image ID:      10.172.49.246/zongbu-sre/alpine-arm64@sha256:af06af3514c44a964d3b905b498cf6493db8f1cde7c10e078213a89c87308ba0
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      set -ex;ordinal=`echo ${POD_NAME} | awk -F- '{print $NF}'`;sed s/POD_NUM/${ordinal}/g /etc/tiflash/config_templ.toml > /data0/config.toml;sed s/POD_NUM/${ordinal}/g /etc/tiflash/proxy_templ.toml > /data0/proxy.toml
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 30 Jan 2023 11:26:57 +0800
      Finished:     Mon, 30 Jan 2023 11:26:57 +0800
    Ready:          True
    Restart Count:  0
    Environment:
      POD_NAME:  tidb-test-cluster-tiflash-2 (v1:metadata.name)
    Mounts:
      /data0 from data0 (rw)
      /etc/tiflash from config (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ccks8 (ro)
Containers:
  tiflash:
    Container ID:  containerd://4f01a3705a5e355f152addc929e5d71b8608212dedc2a60845c8166ab24358a7
    Image:         10.172.49.246/zongbu-sre/tiflash-arm64:v6.1.2
    Image ID:      10.172.49.246/zongbu-sre/tiflash-arm64@sha256:96f39d55b339c1b9e61f09fe8c8d6e0ef69add8557a0cf77a4b340d561f8c0aa
    Ports:         3930/TCP, 20170/TCP, 9000/TCP, 8123/TCP, 9009/TCP, 8234/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
    Command:
      /bin/sh
      -c
      /tiflash/tiflash server --config-file /data0/config.toml
    State:          Running
      Started:      Mon, 30 Jan 2023 14:57:11 +0800
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 30 Jan 2023 14:51:26 +0800
      Finished:     Mon, 30 Jan 2023 14:52:03 +0800
    Ready:          True
    Restart Count:  25
    Limits:
      cpu:     12
      memory:  16Gi
    Requests:
      cpu:     12
      memory:  16Gi
    Environment:
      NAMESPACE:              default (v1:metadata.namespace)
      CLUSTER_NAME:           tidb-test-cluster
      HEADLESS_SERVICE_NAME:  tidb-test-cluster-tiflash-peer
      CAPACITY:               0
      TZ:                     UTC
    Mounts:
      /data0 from data0 (rw)
      /etc/podinfo from annotations (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ccks8 (ro)
  serverlog:
    Container ID:  containerd://3b48f767a9a2fcb893ce14902a2f822d2dc287cc2760fed926f9b4c202347eb6
    Image:         10.172.49.246/zongbu-sre/alpine-arm64:3.17.0
    Image ID:      10.172.49.246/zongbu-sre/alpine-arm64@sha256:af06af3514c44a964d3b905b498cf6493db8f1cde7c10e078213a89c87308ba0
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      touch /data0/logs/server.log; tail -n0 -F /data0/logs/server.log;
    State:          Running
      Started:      Mon, 30 Jan 2023 11:27:07 +0800
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /data0 from data0 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ccks8 (ro)
  errorlog:
    Container ID:  containerd://072de3619af7b33017fae83e57253b56b464eb6b1d8e45d4fc95838bcd624cbb
    Image:         10.172.49.246/zongbu-sre/alpine-arm64:3.17.0
    Image ID:      10.172.49.246/zongbu-sre/alpine-arm64@sha256:af06af3514c44a964d3b905b498cf6493db8f1cde7c10e078213a89c87308ba0
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      touch /data0/logs/error.log; tail -n0 -F /data0/logs/error.log;
    State:          Running
      Started:      Mon, 30 Jan 2023 11:27:07 +0800
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /data0 from data0 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ccks8 (ro)
  clusterlog:
    Container ID:  containerd://3c9fa1dc607d73c78ab72bd380c5bf6121cde586875c4963b41e9d4293579685
    Image:         10.172.49.246/zongbu-sre/alpine-arm64:3.17.0
    Image ID:      10.172.49.246/zongbu-sre/alpine-arm64@sha256:af06af3514c44a964d3b905b498cf6493db8f1cde7c10e078213a89c87308ba0
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      touch /data0/logs/flash_cluster_manager.log; tail -n0 -F /data0/logs/flash_cluster_manager.log;
    State:          Running
      Started:      Mon, 30 Jan 2023 11:27:07 +0800
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /data0 from data0 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ccks8 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  data0:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data0-tidb-test-cluster-tiflash-2
    ReadOnly:   false
  annotations:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.annotations -> annotations
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      tidb-test-cluster-tiflash-3336363
    Optional:  false
  kube-api-access-ccks8:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                   From     Message
  ----     ------   ----                  ----     -------
  Normal   Created  84m (x14 over 3h45m)  kubelet  Created container tiflash
  Normal   Started  84m (x14 over 3h45m)  kubelet  Started container tiflash
  Warning  BackOff  20m (x307 over 150m)  kubelet  Back-off restarting failed container
  Normal   Pulled   15m (x25 over 3h9m)   kubelet  Container image "10.172.49.246/zongbu-sre/tiflash-arm64:v6.1.2" already present on machine
| username: h5n1 | Original post link

After stabilizing and rerunning the TPC-H test, it was observed that the system restarted due to OOM. The initial restart was likely also due to OOM issues, but it is unclear why there was a prolonged crash loopback in between. The phenomenon was that after a long period of crash loopback, the system would run for a few minutes before entering another crash loop.

tidb-test-cluster-tiflash-2                    3/4     OOMKilled   25 (24m ago)   3h49m
tidb-test-cluster-tiflash-2                    4/4     Running     26 (2s ago)    3h49m
tidb-test-cluster-tiflash-0                    3/4     OOMKilled   26 (19m ago)   3h50m
tidb-test-cluster-tiflash-0                    4/4     Running     27 (1s ago)    3h50m
tidb-test-cluster-tiflash-1                    3/4     OOMKilled   27 (22m ago)   3h53m
| username: TiDBer_jYQINSnf | Original post link

OOM occurred, increase the memory size.

| username: TiDBer_jYQINSnf | Original post link

You need to check the TiFlash logs for this. The logs you posted don’t contain relevant information. You can try increasing the memory first or limiting TiFlash’s memory usage.

| username: h5n1 | Original post link

I increased the memory, and indeed there is no OOM, everything is normal. It’s just that previously it kept crash looping, and I don’t know why.

| username: TiDBer_jYQINSnf | Original post link

I have a 2T TiFlash, which failed and couldn’t start again after restarting. The logs show it reaches a fatal point during startup. It hasn’t been resolved yet, so I temporarily added another TiFlash to solve it.

I don’t know much about TiFlash’s code, but if it keeps crash looping, you can check the logs for any clues.

| username: h5n1 | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.