TiFlash fails to start on Kubernetes

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: Tiflash在k8s上启动失败

| username: Hacker_Oka5avP2

[TiDB Usage Environment] Test/PoC
[TiDB Version] 6.1.0
[Reproduction Path]
Add a TiFlash node to the existing TiDB cluster through the operator
[Encountered Issue: Symptoms and Impact]
TiFlash failed to start and could not join the cluster
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]
Our TiDB has a password set, but I understand it should not affect this

Operator config:

  tiflash:
    baseImage: pingcap/tiflash
    maxFailoverCount: 0
    replicas: 1
    storageClaims:
    - resources:
        requests:
          storage: {{ .Values.tiflash.storageSize }}
      storageClassName: shared-ssd-storage
    requests:
      memory: "48Gi"
      cpu:   "32"
    config: 
      config: |
        [flash]
          [flash.flash_cluster]
            log = "/data0/logs/flash_cluster_manager.log"
        [logger]
          count = 10
          level = "information"
          errorlog = "/data0/logs/error.log"
          log = "/data0/logs/server.log"

server log:

[2022/11/10 08:09:55.756 +00:00] [INFO] [<unknown>] ["Welcome to TiFlash"] [thread_id=1]
[2022/11/10 08:09:55.756 +00:00] [INFO] [<unknown>] ["Starting daemon with revision 54381"] [thread_id=1]
[2022/11/10 08:09:55.756 +00:00] [INFO] [<unknown>] ["TiFlash build info: TiFlash\nRelease Version: v6.1.0\nEdition:         Community\nGit Commit Hash: ebf7ce6d9fb4090011876352fe26b89668cbedc4\nGit Branch:      heads/refs/tags/v6.1.0\nUTC Build Time:  2022-06-07 11:55:49\nEnable Features: jemalloc avx avx512 unwind\nProfile:         RELWITHDEBINFO\n"] [thread_id=1]
[2022/11/10 08:09:55.756 +00:00] [INFO] [<unknown>] ["Application:starting up"] [thread_id=1]
[2022/11/10 08:09:55.757 +00:00] [INFO] [Server.cpp:363] ["Application:Got jemalloc version: 5.2.1-0-gea6b3e973b477b8061e0076bb257dbd7f3faa756"] [thread_id=1]
[2022/11/10 08:09:55.757 +00:00] [INFO] [Server.cpp:372] ["Application:Not found environment variable MALLOC_CONF"] [thread_id=1]
[2022/11/10 08:09:55.757 +00:00] [INFO] [Server.cpp:378] ["Application:Got jemalloc config: opt.background_thread false, opt.max_background_threads 4"] [thread_id=1]
[2022/11/10 08:09:55.757 +00:00] [INFO] [Server.cpp:382] ["Application:Try to use background_thread of jemalloc to handle purging asynchronously"] [thread_id=1]
[2022/11/10 08:09:55.757 +00:00] [INFO] [Server.cpp:385] ["Application:Set jemalloc.max_background_threads 1"] [thread_id=1]
[2022/11/10 08:09:55.757 +00:00] [INFO] [Server.cpp:388] ["Application:Set jemalloc.background_thread true"] [thread_id=1]
[2022/11/10 08:09:55.757 +00:00] [INFO] [Server.cpp:452] ["Application:start raft store proxy"] [thread_id=1]
[2022/11/10 08:09:55.757 +00:00] [INFO] [Server.cpp:1025] ["Application:wait for tiflash proxy initializing"] [thread_id=1]
[2022/11/10 08:09:55.758 +00:00] [ERROR] [BaseDaemon.cpp:377] [BaseDaemon:########################################] [thread_id=3]
[2022/11/10 08:09:55.758 +00:00] [ERROR] [BaseDaemon.cpp:378] ["BaseDaemon:(from thread 2) Received signal Aborted(6)."] [thread_id=3]
[2022/11/10 08:09:57.242 +00:00] [ERROR] [BaseDaemon.cpp:570] ["BaseDaemon:\n       0x1ed2661\tfaultSignalHandler(int, siginfo_t*, void*) [tiflash+32319073]\n                \tlibs/libdaemon/src/BaseDaemon.cpp:221\n  0x7fe276364630\t<unknown symbol> [libpthread.so.0+63024]\n  0x7fe275da7387\tgsignal [libc.so.6+222087]\n  0x7fe275da8a78\t__GI_abort [libc.so.6+227960]\n  0x7fe278d134ea\tstd::sys::unix::abort_internal::haf36307751e7fe01 (.llvm.12916800362124319953) [libtiflash_proxy.so+34161898]\n  0x7fe278d0c6ba\trust_panic [libtiflash_proxy.so+34133690]\n  0x7fe278d0c329\tstd::panicking::rust_panic_with_hook::h9c6cab3d6f14fe6f [libtiflash_proxy.so+34132777]\n  0x7fe278d0bf60\tstd::panicking::begin_panic_handler::_$u7b$$u7b$closure$u7d$$u7d$::h227dbfeee564b658 [libtiflash_proxy.so+34131808]\n  0x7fe278d094e7\tstd::sys_common::backtrace::__rust_end_short_backtrace::h99435fb636049671 (.llvm.12916800362124319953) [libtiflash_proxy.so+34120935]\n  0x7fe278d0bce0\trust_begin_unwind [libtiflash_proxy.so+34131168]\n  0x7fe2779fb054\tcore::panicking::panic_fmt::h05626cefcc91481d [libtiflash_proxy.so+14139476]\n  0x7fe2779faec8\tcore::panicking::panic::h3bcdaac666fd7377 [libtiflash_proxy.so+14139080]\n  0x7fe2794ccc68\ttikv_util::sys::cgroup::build_path::h8c8024b5396924b2 [libtiflash_proxy.so+42261608]\n  0x7fe2794ca19e\ttikv_util::sys::cgroup::CGroupSys::cpuset_cores::h01751adb3bcda1d8 [libtiflash_proxy.so+42250654]\n  0x7fe2794cdec6\ttikv_util::sys::SysQuota::cpu_cores_quota::h8c8d162fbefeb10a [libtiflash_proxy.so+42266310]\n  0x7fe2792b8186\t_$LT$tikv..server..config..Config$u20$as$u20$core..default..Default$GT$::default::hdfa398ba89abecb8 [libtiflash_proxy.so+40079750]\n  0x7fe279444d41\t_$LT$tikv..server..config.._IMPL_DESERIALIZE_FOR_Config..$LT$impl$u20$serde..de..Deserialize$u20$for$u20$tikv..server..config..Config$GT$..deserialize..__Visitor$u20$as$u20$serde..de..Visitor$GT$::visit_map::h5f2ea3cbdf936511 [libtiflash_proxy.so+41704769]\n  0x7fe2793b7448\t_$LT$tikv..config.._IMPL_DESERIALIZE_FOR_TiKvConfig..$LT$impl$u20$serde..de..Deserialize$u20$for$u20$tikv..config..TiKvConfig$GT$..deserialize..__Visitor$u20$as$u20$serde..de..Visitor$GT$::visit_map::h33ab3cb55aa8dbcf [libtiflash_proxy.so+41124936]\n  0x7fe2791219e2\ttikv::config::TiKvConfig::from_file::h576988465b4da7d2 [libtiflash_proxy.so+38414818]\n  0x7fe2784350ba\tserver::proxy::run_proxy::he9e2085225ada8f8 [libtiflash_proxy.so+24862906]\n       0x1d694ff\tDB::RaftStoreProxyRunner::runRaftStoreProxyFFI(void*) [tiflash+30840063]\n                \tdbms/src/Server/Server.cpp:462\n  0x7fe27635cea5\tstart_thread [libpthread.so.0+32421]"] [thread_id=3]

Config in the pod:

default_profile = "default"
display_name = "TiFlash"
http_port = 8123
interserver_http_port = 9009
listen_host = "0.0.0.0"
mark_cache_size = 5368709120
minmax_index_cache_size = 5368709120
path = "/data0/db"
path_realtime_mode = false
tcp_port = 9000
tmp_path = "/data0/tmp"

[application]
  runAsDaemon = true

[flash]
  compact_log_min_period = 200
  overlap_threshold = 0.6
  service_addr = "0.0.0.0:3930"
  tidb_status_addr = "test-tidb.tidbstore.svc:10080"
  [flash.flash_cluster]
    cluster_manager_path = "/tiflash/flash_cluster_manager"
    log = "/data0/logs/flash_cluster_manager.log"
    master_ttl = 60
    refresh_interval = 20
    update_rule_interval = 10
  [flash.proxy]
    addr = "0.0.0.0:20170"
    advertise-addr = "test-tiflash-0.test-tiflash-peer.tidbstore.svc:20170"
    config = "/data0/proxy.toml"
    data-dir = "/data0/proxy"

[logger]
  count = 10
  errorlog = "/data0/logs/error.log"
  level = "information"
  log = "/data0/logs/server.log"
  size = "100M"

[profiles]
  [profiles.default]
    load_balancing = "random"
    max_memory_usage = 10000000000
    use_uncompressed_cache = 0
  [profiles.readonly]
    readonly = 1

[quotas]
  [quotas.default]
    [quotas.default.interval]
      duration = 3600
      errors = 0
      execution_time = 0
      queries = 0
      read_rows = 0
      result_rows = 0

[raft]
  kvstore_path = "/data0/kvstore"
  pd_addr = "test-pd.tidbstore.svc:2379"
  storage_engine = "dt"

[status]
  metrics_port = 8234

[users]
  [users.default]
    password = ""
    profile = "default"
    quota = "default"
    [users.default.networks]
      ip = "::/0"
  [users.readonly]
    password = ""
    profile = "readonly"
    quota = "default"
    [users.readonly.networks]
      ip = "::/0"
| username: Hacker_Oka5avP2 | Original post link

Add a cluster topology, it seems like a connectivity issue.

test-discovery          ClusterIP   10.96.73.219    <none>        10261/TCP,10262/TCP              72d
test-grafana            NodePort    10.96.15.87     <none>        3000:32255/TCP                   72d
test-monitor-reloader   NodePort    10.96.170.18    <none>        9089:30505/TCP                   72d
test-pd                 ClusterIP   10.96.13.105    <none>        2379/TCP                         72d
test-pd-peer            ClusterIP   None            <none>        2380/TCP                         72d
test-prometheus         NodePort    10.96.250.229   <none>        9090:31017/TCP                   72d
test-tidb               NodePort    10.96.39.181    <none>        4000:30401/TCP,10080:31070/TCP   72d
test-tidb-peer          ClusterIP   None            <none>        10080/TCP                        72d
test-tiflash-peer       ClusterIP   None            <none>        3930/TCP,20170/TCP               134m
test-tikv-peer          ClusterIP   None            <none>        20160/TCP                        72d
| username: Billmay表妹 | Original post link

Could it be an account permissions issue~

| username: Hacker_Oka5avP2 | Original post link

It doesn’t seem like it. I have a small cluster that is running quite well, and since it was deployed using an operator, it shouldn’t need SSH, right?

| username: 数据小黑 | Original post link

Did the issues mentioned in this blog post help you? Could it be a configuration error or resource limitation?

| username: tidb狂热爱好者 | Original post link

I feel that the issue might be due to using an incorrect operating system version, not the recommended CentOS 7, but rather Red Hat or Alpine. Check this error: tiflash 5.4.0 about `GLIBC_PRIVATE' not found · Issue #33266 · pingcap/tidb · GitHub. It might be helpful.

| username: Hacker_Oka5avP2 | Original post link

To be honest, I have considered it, but I am using the official TiFlash image pingcap/tiflash:v6.1.0, so I shouldn’t encounter any operating system issues.

| username: Hacker_Oka5avP2 | Original post link

Thank you for the reply. I have seen this article, but my problem seems to occur much earlier than what they mentioned :joy: Mine obviously crashes during initialization.

| username: Hacker_Oka5avP2 | Original post link

Adding one more point, I found that TiFlash is not down, and one CPU core is constantly running.

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1 99.9  0.8 1482256 822136 ?      Ssl  Nov11 4361:03 /tiflash/tiflash server --config-file /data0/config.toml

Kernel

[root@test-tiflash-0 /]# uname -a
Linux test-tiflash-0 5.4.0-1088-aws #96~18.04.1-Ubuntu SMP Mon Oct 17 02:57:48 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
| username: Lucien-卢西恩 | Original post link

Hi, do you use TLS? If so, you might want to follow the progress of this issue: TiFlash crash when open tls · Issue #5953 · pingcap/tiflash · GitHub

| username: Hacker_Oka5avP2 | Original post link

Hi, my default configuration should not involve TLS. Moreover, you can see in the config that http_port is configured and the entire security section is missing, so I understand that TLS should not be involved.

Is it possible that it can’t connect to the TiDB server? I suspect it might be due to this reason because we have set a password for TiDB.

| username: Lucien-卢西恩 | Original post link

How did you add the password?

| username: Billmay表妹 | Original post link

I suspect it might be a permissions issue.

| username: Hacker_Oka5avP2 | Original post link

It’s for the MySQL client,

SET PASSWORD
| username: Lucien-卢西恩 | Original post link

It doesn’t seem to have anything to do with whether TLS is enabled or not. Looking at the last line of the error, the panic location is tiflash_proxy.so. You can check the TiKV log for any error messages. Also, are both TiKV and TiFlash versions v6.1.0?

| username: Hacker_Oka5avP2 | Original post link

  1. All are v6.1.0 images.
  2. I checked tikv, pd, and tidb, and there were no related error logs during initialization.

However, I did find something new in the tiflash logs that might be useful:

thread '<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', components/tikv_util/src/sys/cgroup.rs:260:44
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5

It’s quite amazing how this is calling tikv’s library.

| username: huangjiaqiang | Original post link

  1. Please use kubectl get po -o wide to check if the nodes where TiKV and TiFlash are located are the same.
  2. Also, check the OS version corresponding to the physical machines of each pod.

img_v2_0cc87178-8af0-490a-93e2-026132c4a06g

It seems that the cpu_cores_quota function called build_path and panicked while fetching the number of CPU cores. It looks like it might be due to unsupported OS.

| username: AnotherCalvinNeo | Original post link

TiFlash includes a customized version of TiKV, namely TiFlash Proxy.

| username: Hacker_Oka5avP2 | Original post link

  1. The physical machines used are indeed different. TiKV uses AWS’s i3.4xlarge, and TiFlash, if I remember correctly, uses c6i.8xlarge.
  2. I printed out the Linux versions on both sides:
TiFlash:
    Linux <tiflash> 5.4.0-1088-aws #96~18.04.1-Ubuntu SMP Mon Oct 17 02:57:48 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
TiKV:
    Linux <tikv> 5.4.0-1055-aws #58~18.04.1-Ubuntu SMP Wed Jul 28 03:04:50 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

There is indeed a difference. Could this cause any issues?
By the way, I also checked the TiFlash pod:

Linux test-tiflash-0 5.4.0-1088-aws #96~18.04.1-Ubuntu SMP Mon Oct 17 02:57:48 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Additionally, I checked another cluster with a normally running TiFlash, and the kernel version is 1083:

Linux <tiflash> 5.4.0-1083-aws #90~18.04.1-Ubuntu SMP Fri Aug 5 08:12:44 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
| username: huangjiaqiang | Original post link

  1. Please first change the system version of TiFlash to be the same as TiKV and see if it can work around the issue.
  2. Then we can check if it’s a kernel version problem.