TiKV triggers RocksDB background error panic and cannot restart

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv 触发 rocksdb background 错误 panic 无法重启

| username: wenlive

[TiDB Usage Environment] Production Environment
[TiDB Version] tikv pd 5.2.0
[Encountered Problem] tikv triggered rocksdb background error panic and cannot restart
[Reproduction Path] From the logs, it shows unsafe destroy range → compaction → panic
[Problem Phenomenon and Impact] tikv cannot restart, can only discard the corrupted node through scaling in and out
[Attachment]

Please provide the version information of each component, such as cdc/tikv, which can be obtained by executing cdc version/tikv-server --version.

[2022/09/26 23:47:30.393 +08:00] [INFO] [gc_worker.rs:389] [“unsafe destroy range started”] [end_key=6D757369632D6B76FF2D7072645F333635FF33363A3030303A44FF3AC9074B08C73145FFD5A393C5D48E442AFFCE00000000000000F8] [start_key=6D757369632D6B76FF2D7072645F333635FF33363A3030303A44FF3AC9074B08C73145FFD5A393C5D48E442AFFCD00000000000000F8]
[2022/09/26 23:47:30.396 +08:00] [INFO] [gc_worker.rs:420] [“unsafe destroy range finished deleting files in range”] [cost_time=2.414008ms] [end_key=6D757369632D6B76FF2D7072645F333635FF33363A3030303A44FF3AC9074B08C73145FFD5A393C5D48E442AFFCE00000000000000F8] [start_key=6D757369632D6B76FF2D7072645F333635FF33363A3030303A44FF3AC9074B08C73145FFD5A393C5D48E442AFFCD00000000000000F8]
[2022/09/26 23:47:30.400 +08:00] [INFO] [gc_worker.rs:454] [“unsafe destroy range finished cleaning up all”] [cost_time=4.394487ms] [end_key=6D757369632D6B76FF2D7072645F333635FF33363A3030303A44FF3AC9074B08C73145FFD5A393C5D48E442AFFCE00000000000000F8] [start_key=6D757369632D6B76FF2D7072645F333635FF33363A3030303A44FF3AC9074B08C73145FFD5A393C5D48E442AFFCD00000000000000F8]
[2022/09/26 23:47:32.247 +08:00] [INFO] [compaction_filter.rs:483] [“Compaction filter reports”] [filtered=204674] [total=1376419]
[2022/09/26 23:47:38.931 +08:00] [INFO] [compaction_filter.rs:483] [“Compaction filter reports”] [filtered=438497] [total=1990713]
[2022/09/26 23:47:43.356 +08:00] [INFO] [compaction_filter.rs:483] [“Compaction filter reports”] [filtered=417840] [total=1878614]
[2022/09/26 23:48:14.374 +08:00] [FATAL] [lib.rs:465] [“rocksdb background error. db: kv, reason: compaction, error: Corruption: block checksum mismatch: expected 1704905625, got 1445134835 in /data/tidb/tikv/11161/tikv/data/db/5989462.sst offset 3196554 size 18449”] [backtrace="stack backtrace:
0: tikv_util::set_panic_hook::{{closure}}
at components/tikv_util/src/lib.rs:464
1: std::panicking::rust_panic_with_hook
at library/std/src/panicking.rs:626
2: std::panicking::begin_panic_handler::{{closure}}
at library/std/src/panicking.rs:519
3: std::sys_common::backtrace::__rust_end_short_backtrace
at library/std/src/sys_common/backtrace.rs:141
4: rust_begin_unwind
at library/std/src/panicking.rs:515
5: std::panicking::begin_panic_fmt
at library/std/src/panicking.rs:457
6: <engine_rocks::event_listener::RocksEventListener as rocksdb::event_listener::EventListener>::on_background_error
7: rocksdb::event_listener::on_background_error
at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/4e912a8/src/event_listener.rs:340
8: _ZN24crocksdb_eventlistener_t17OnBackgroundErrorEN7rocksdb21BackgroundErrorReasonEPNS0_6StatusE
at crocksdb/c.cc:2352
9: _ZN7rocksdb7titandb11TitanDBImpl10SetBGErrorERKNS_6StatusE\

 at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/4e912a8/librocksdb_sys/libtitan_sys/titan/src/db_impl.cc:1447\

10: _ZN7rocksdb7titandb11TitanDBImpl12BackgroundGCEPNS_9LogBufferEj
at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/4e912a8/librocksdb_sys/libtitan_sys/titan/src/db_impl_gc.cc:236
11: _ZN7rocksdb7titandb11TitanDBImpl16BackgroundCallGCEv
at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/4e912a8/librocksdb_sys/libtitan_sys/titan/src/db_impl_gc.cc:136
12: _ZNKSt8functionIFvvEEclEv
at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/std_function.h:687
_ZN7rocksdb14ThreadPoolImpl4Impl8BGThreadEm\

      at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/4e912a8/librocksdb_sys/rocksdb/util/threadpool_imp.cc:266\

13: _ZN7rocksdb14ThreadPoolImpl4Impl15BGThreadWrapperEPv
at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/4e912a8/librocksdb_sys/rocksdb/util/threadpool_imp.cc:307
14: execute_native_thread_routine
15: start_thread
16: __clone
"] [location=components/engine_rocks/src/event_listener.rs:108] [thread_name=]
[2022/09/26 23:48:30.808 +08:00] [INFO] [lib.rs:80] [“Welcome to TiKV”]
[2022/09/26 23:48:30.808 +08:00] [INFO] [lib.rs:85] [“Release Version: 5.2.0”]

| username: wenlive | Original post link

Are there any other solutions besides scaling up or down? Or how to avoid this phenomenon from occurring? Currently, I’m not very clear about the triggering reasons.

| username: h5n1 | Original post link

Judging by the error, the SST file is corrupted. The basic handling approach based on the description is:

  1. Use tikv-ctl bad-ssts to scan TiKV and find the corrupted SST files.
  2. Confirm the region based on the scan output and delete the SST files.
  3. Delete the region peer covered by the SST files.
| username: wenlive | Original post link

In the documentation, the tikv-ctl bad-ssts parameter db has been replaced with --data-dir.

In actual usage, it is still --db and the output is:

./tikv-ctl bad-ssts --pd <pd> --db ../data/db/
[2022/09/27 14:17:28.303 +08:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=10.193.204.145:12389]
[2022/09/27 14:17:28.303 +08:00] [INFO] [<unknown>] ["Disabling AF_INET6 sockets because ::1 is not available."]
[2022/09/27 14:17:28.303 +08:00] [INFO] [<unknown>] ["TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter"]
[2022/09/27 14:17:28.304 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7fcf0182e1b0 for subchannel 0x7fcf04612ec0"]
[2022/09/27 14:17:28.304 +08:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=http://10.193.73.145:12387]
[2022/09/27 14:17:28.305 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7fcf0182e2d0 for subchannel 0x7fcf04613080"]
[2022/09/27 14:17:28.305 +08:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=http://10.193.204.145:12389]
[2022/09/27 14:17:28.306 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7fcf0182e3f0 for subchannel 0x7fcf04612ec0"]
[2022/09/27 14:17:28.306 +08:00] [INFO] [util.rs:668] ["connected to PD member"] [endpoints=http://10.193.204.145:12389]
[2022/09/27 14:17:28.306 +08:00] [INFO] [util.rs:536] ["all PD endpoints are consistent"] [endpoints="[\"10.193.204.145:12389\"]"]
--------------------------------------------------------
corruption analysis has completed

No useful information was obtained.

| username: h5n1 | Original post link

Didn’t you already scale down the damaged TiKV?

| username: wenlive | Original post link

Files have not been deleted, you can still check. Doesn’t the bad-ssts check here also require shutting down the currently running TiKV instance?

Additionally, in the release-5.2 branch code


tikv-ctl is inconsistent with the documentation TiKV Control User Guide | PingCAP Docs.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.