TiKV reload reports metric tikv_raftstore_region_count{type="leader"} not found

translator_bot · June 23, 2024, 6:41am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv reload报 metric tikv_raftstore_region_count{type=“leader”} not found

| username: h5n1

[Version] v6.1.0 ARM
raft-engine.dir: /raft/raftdb
[Phenomenon] After adding a label to tikv using pd-ctl store label and executing reload -R tikv, a certain tikv node reports:
Restart instance xxx130.96:20160 success

Error: failed to evict store leader xxx130.97: metric tikv_raftstore_region_count{type=“leader”} not found
[Inspection]

Checked the region information on tikv, all are 0

image1054×113 5.71 KB
Checked tikv status

image904×79 8.46 KB
Found errors in tikv.log
[FATAL] [lib.rs:491] [“Failed to reserve space for recovery: Structure needs cleaning (os error 117).”]
Checked file system, parameters, placeholder files
Tried restarting tikv, still reported the above error
[2022/08/17 15:39:27.324 +08:00] [INFO] [config.rs:891] [“data dir”] [mount_fs=“FsInfo { tp: "ext4", opts: "rw,noatime,nodelalloc,stripe=64", mnt_dir: "/data", fsname: "/dev/mapper/datavg-lv_data" }”] [data_path=/data/tikv/tikv-20160/raft]

[2022/08/17 15:33:40.836 +08:00] [WARN] [server.rs:457] [“failed to remove space holder on starting: No such file or directory (os error 2)”]

[2022/08/17 15:33:44.895 +08:00] [FATAL] [lib.rs:491] [“Failed to reserve space for recovery: Structure needs cleaning (os error 117).”] [backtrace=" 0: tikv_util::set_panic_hook::{{closure}}
at /var/lib/docker/jenkins/workspace/build-common@3/go/src/github.com/pingcap/tikv/components/tikv_util/src/lib.rs:490:18
1: std::panicking::rust_panic_with_hook
at /root/.rustup/toolchains/nightly-2022-02-14-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:702:17
2: std::panicking::begin_panic_handler::{{closure}}
at /root/.rustup/toolchains/nightly-2022-02-14-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:588:13
3: std::sys_common::backtrace::__rust_end_short_backtrace
at /root/.rustup/toolchains/nightly-2022-02-14-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys_common/backtrace.rs:138:18
4: rust_begin_unwind
at /root/.rustup/toolchains/nightly-2022-02-14-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:584:5
5: core::panicking::panic_fmt
at /root/.rustup/toolchains/nightly-2022-02-14-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/panicking.rs:143:14
6: server::server::TiKvServer::init_fs::{{closure}}
at /var/lib/docker/jenkins/workspace/build-common@3/go/src/github.com/pingcap/tikv/components/server/src/server.rs:467:26
core::result::Result<T,E>::map_err
at /root/.rustup/toolchains/nightly-2022-02-14-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/result.rs:842:27
server::server::TiKvServer::init_fs
at /var/lib/docker/jenkins/workspace/build-common@3/go/src/github.com/pingcap/tikv/components/server/src/server.rs:463:13
7: server::server::run_impl
at /var/lib/docker/jenkins/workspace/build-common@3/go/src/github.com/pingcap/tikv/components/server/src/server.rs:124:5
server::server::run_tikv
at /var/lib/docker/jenkins/workspace/build-common@3/go/src/github.com/pingcap/tikv/components/server/src/server.rs:163:5
8: tikv_server::main
at /var/lib/docker/jenkins/workspace/build-common@3/go/src/github.com/pingcap/tikv/cmd/tikv-server/src/main.rs:189:5
9: core::ops::function::FnOnce::call_once
at /root/.rustup/toolchains/nightly-2022-02-14-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:227:5
std::sys_common::backtrace::__rust_begin_short_backtrace
at /root/.rustup/toolchains/nightly-2022-02-14-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys_common/backtrace.rs:122:18
10: main
11: __libc_start_main
12:
"] [location=components/server/src/server.rs:467] [thread_name=main]

[Questions]

The tikv node is already down, why does it still check the leader count using metric tikv_raftstore_region_count{type=“leader”} during reload? Can this state be skipped or handled in another way? (Maybe this tikv had issues during deployment, but it wasn’t noticed at the time)
In emergencies, the space holder file can be deleted to free up disk space. Here, the error [“Failed to reserve space for recovery: Structure needs cleaning (os error 117).”] is reported during restart. OS error code 117: Structure needs cleaning. What structures need to be cleaned? The placeholder file has not been manually deleted before.

translator_bot · June 23, 2024, 6:41am

| username: TiDBer_jYQINSnf | Original post link

I looked through the code, and this happens during startup when it tries to create a place_holder file and fails.

The operating system returned error 117, which is not an internal TiKV error.
I searched online for error 117, and some feedback suggests checking the file system.

Given that your environment is ARM, various issues are not surprising.

translator_bot · June 23, 2024, 6:41am

| username: h5n1 | Original post link

I checked the dmesg logs:

[1671835.208992] EXT4-fs error (device dm-4): ext4_validate_block_bitmap:384: comm rocksdb:low3: bg 66912: bad block bitmap checksum
[1671864.370990] EXT4-fs error (device dm-4): ext4_validate_block_bitmap:384: comm rocksdb:low3: bg 66928: bad block bitmap checksum
[1671888.674507] EXT4-fs error (device dm-4): ext4_validate_block_bitmap:384: comm rocksdb:low1: bg 66944: bad block bitmap checksum
[1671895.263991] EXT4-fs error (device dm-4): ext4_validate_block_bitmap:384: comm rocksdb:low1: bg 66960: bad block bitmap checksum
[1671903.547248] EXT4-fs error (device dm-4): ext4_validate_block_bitmap:384: comm rocksdb:low1: bg 66976: bad block bitmap checksum
[1671910.771744] EXT4-fs error (device dm-4): ext4_validate_block_bitmap:384: comm rocksdb:low1: bg 66992: bad block bitmap checksum
[1672003.968794] EXT4-fs error (device dm-4): ext4_validate_block_bitmap:384: comm apply-1: bg 67008: bad block bitmap checksum
[1672063.705650] EXT4-fs error (device dm-4): ext4_validate_block_bitmap:384: comm apply-1: bg 67024: bad block bitmap checksum
[1672087.532957] EXT4-fs error (device dm-4): ext4_validate_block_bitmap:384: comm rocksdb:low3: bg 67040: bad block bitmap checksum
[1672087.550289] EXT4-fs error (device dm-4): ext4_validate_block_bitmap:384: comm rocksdb:low3: bg 67047: bad block bitmap checksum

Now it seems to be mysteriously working again.

translator_bot · June 23, 2024, 6:41am

| username: TiDBer_jYQINSnf | Original post link

It feels like the disk is faulty.

translator_bot · June 23, 2024, 6:41am

| username: jansu-dev | Original post link

Has it returned to normal? Are all logs and statuses normal?
When the TiKV node is already down, why does it still check the leader count through the metric tikv_raftstore_region_count{type=“leader”} during reload? Can this state be skipped or handled in another way? (Maybe there was an issue during the deployment of this TiKV, which was not noticed at the time)
→ failed to evict store leader xxx130.97: metric tikv_raftstore_region_count{type="leader"} not found is likely reported by tiup. Its action is to GetCurrentStore and then evict the leader without checking if the store is down. However, this is reasonable because if the store is down, it should be fixed first to avoid undefined behavior. The leader count of the store is checked through the genLeaderCounter function, which looks at the monitoring. If the store is down, it cannot curl the monitoring information, resulting in an error.
a. If you just want the reload to succeed, you probably need to fix why the store is down first.
b. If you just want to persist the new parameters to the toml file, the goal should have been achieved.
The space holder file can be deleted in an emergency to free up disk space. When restarting, it reports [“Failed to reserve space for recovery: Structure needs cleaning (os error 117).”] error, OS error code 117: Structure needs cleaning. What structures need to be cleaned? The placeholder file has not been manually deleted before.

→ Looking at the detection logic, if the spaceholder exists, it will be deleted, followed by subsequent actions. failed to remove space holder on starting: No such file or directory (os error 2) is thrown back by calling the remove_file function from the official Rust fs.rs library. A workaround is to delete and recreate a space holder file with the same name, this file will be automatically recreated to the corresponding size after a restart. However, this should not cause the TiKV server to go down.

translator_bot · June 23, 2024, 6:41am

| username: h5n1 | Original post link

This topic was automatically closed 60 days after the last reply. No new replies are allowed.