TiFlash cannot start after restarting during operation

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tiflash 运行中重启 启动不了

| username: TiDBer_jYQINSnf

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version]
v5.4.0 k8s deployment
[Reproduction Path] What operations were performed to cause the issue
TiFlash was running, then a table was dropped, causing high latency in TiFlash. Forced restart of TiFlash (by deleting the pod and restarting it in place).
[Encountered Issue: Symptoms and Impact]
It kept crashing.
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

[2023/01/17 12:41:44.674 +00:00] [FATAL] [lib.rs:463] [“[region 143590] 6322160 applying snapshot failed”] [backtrace=“stack backtrace:\n 0: tikv_util::set_panic_hook::{{closure}}\n 1: std::panicking::rust_panic_with_hook\n at library/std/src/panicking.rs:595\n 2: std::panicking::begin_panic_handler::{{closure}}\n at library/std/src/panicking.rs:497\n 3: std::sys_common::backtrace::__rust_end_short_backtrace\n at library/std/src/sys_common/backtrace.rs:141\n 4: rust_begin_unwind\n at library/std/src/panicking.rs:493\n 5: std::panicking::begin_panic_fmt\n at library/std/src/panicking.rs:435\n 6: raftstore::store::peer_storage::PeerStorage<EK,ER>::check_applying_snap\n 7: raftstore::store::peer::Peer<EK,ER>::handle_raft_ready_append\n 8: <raftstore::store::fsm::store::RaftPoller<EK,ER,T> as batch_system::batch::PollHandler<raftstore::store::fsm::peer::PeerFsm<EK,ER>,raftstore::store::fsm::store::StoreFsm>>::handle_normal\n 9: batch_system::batch::Poller<N,C,Handler>::poll\n 10: std::sys_common::backtrace::__rust_begin_short_backtrace\n 11: core::ops::function::FnOnce::call_once{{vtable.shim}}\n 12: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce>::call_once\n at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/alloc/src/boxed.rs:1546\n <alloc::boxed::Box<F,A> as core::ops::function::FnOnce>::call_once\n at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/alloc/src/boxed.rs:1546\n std::sys::unix::thread::thread::new::thread_start\n at library/std/src/sys/unix/thread.rs:71\n 13: start_thread\n 14: clone\n”] [location=/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/contrib/tiflash-proxy/components/raftstore/src/store/peer_storage.rs:1408] [thread_name=raftstore-0]

Additionally, I want to find the TiFlash code, but I haven’t seen the Rust part. Can anyone give a brief introduction to the composition of the TiFlash code, such as which component is in which repository?
Thanks a lot!

| username: AnotherCalvinNeo | Original post link

The Rust part is GitHub - pingcap/tidb-engine-ext: A TiKV based `c dynamic library` for extending storage system in TiDB cluster, which is a modified TiKV. Our previous articles have introduced the TiFlash proxy.

Your issue should be related to the TiFlash proxy, which means this modified TiKV failed to apply the snapshot. It could be due to the snapshot file being lost or storage corruption, etc.

| username: TiDBer_jYQINSnf | Original post link

From the PD, there are no replicas of this 143590 region with errors on this store. How can this be recovered?

| username: WalterWj | Original post link

Delete the table tiflas replica >> Normal decommissioning >> Normal scaling >> If there are no issues, try creating the required table replica again.

| username: TiDBer_jYQINSnf | Original post link

This is too slow. With several terabytes of data, you can’t just add or delete it at will. Moreover, TiFlash just restarted and couldn’t come back up. This is a bit fragile.

| username: 会飞的土拨鼠 | Original post link

You can stop TiFlash first, and then start TiFlash again.

| username: TiDBer_jYQINSnf | Original post link

Once it stopped, it couldn’t be started again.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.