TiDB Rancher Unable to Start

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb rancher无法启动

| username: 小鱼吃大鱼

TiDB, installed on Rancher, was running normally before, but today TiKV suddenly cannot start, and TiDB cannot start either.

| username: 芮芮是产品 | Original post link

Few people probably know about this. Please post the error logs.

| username: 小鱼吃大鱼 | Original post link

[2023/11/09 10:07:22.513 +08:00] [WARN] [store.rs:1211] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: "Permission denied" }”]
[2023/11/09 10:07:22.513 +08:00] [INFO] [node.rs:174] [“put store to PD”] [store=“id: 1 address: "tidb-qas-cluster-tikv-2.tidb-qas-cluster-tikv-peer.wanda-db.svc:20160" version: "4.0.4" status_address: "0.0.0.0:20180" git_hash: "28e3d44b00700137de4fa933066ab83e5f8306cf" start_timestamp: 1699495634 deploy_path: "/"”]
[2023/11/09 10:07:22.513 +08:00] [INFO] [mod.rs:335] [“starting working thread”] [worker=cdc]
[2023/11/09 10:07:22.513 +08:00] [INFO] [future.rs:136] [“starting working thread”] [worker=waiter-manager]
[2023/11/09 10:07:22.513 +08:00] [INFO] [future.rs:136] [“starting working thread”] [worker=deadlock-detector]
[2023/11/09 10:07:22.513 +08:00] [INFO] [mod.rs:335] [“starting working thread”] [worker=backup-endpoint]
[2023/11/09 10:07:22.513 +08:00] [INFO] [] [“Failed to add :: listener, the environment may not support IPv6: {"created":"@1699495642.312002333","description":"Address family not supported by protocol","errno":97,"file":"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.5.3/grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":406,"os_error":"Address family not supported by protocol","syscall":"socket","target_address":"[::]:20160"}”]
[2023/11/09 10:07:22.513 +08:00] [INFO] [mod.rs:335] [“starting working thread”] [worker=snap-handler]
[2023/11/09 10:07:22.513 +08:00] [INFO] [server.rs:223] [“listening on addr”] [addr=0.0.0.0:20160]
[2023/11/09 10:07:22.514 +08:00] [INFO] [server.rs:248] [“TiKV is ready to serve”]
[2023/11/09 10:07:22.514 +08:00] [WARN] [mod.rs:489] [“failed to register addr to pd”] [body=Body(Streaming)] [“status code”=400]
[2023/11/09 10:07:22.514 +08:00] [INFO] [util.rs:412] [“connecting to PD endpoint”] [endpoints= ]
[2023/11/09 10:07:22.514 +08:00] [INFO] [] [“New connected subchannel at 0x7fa20b418a80 for subchannel 0x7fa20b439540”]
[2023/11/09 10:07:22.514 +08:00] [INFO] [util.rs:412] [“connecting to PD endpoint”] [endpoints= ]
[2023/11/09 10:07:22.514 +08:00] [INFO] [util.rs:477] [“connected to PD leader”] [endpoints= ]
[2023/11/09 10:07:22.514 +08:00] [INFO] [util.rs:188] [“heartbeat sender and receiver are stale, refreshing …”]
[2023/11/09 10:07:22.514 +08:00] [WARN] [util.rs:207] [“updating PD client done”] [spend=47.145199ms]
[2023/11/09 10:07:22.514 +08:00] [WARN] [mod.rs:489] [“failed to register addr to pd”] [body=Body(Streaming)] [“status code”=400]
[2023/11/09 10:07:22.514 +08:00] [WARN] [mod.rs:489] [“failed to register addr to pd”] [body=Body(Streaming)] [“status code”=400]
[2023/11/09 10:07:22.514 +08:00] [WARN] [mod.rs:489] [“failed to register addr to pd”] [body=Body(Streaming)] [“status code”=400]
[2023/11/09 10:07:22.514 +08:00] [WARN] [mod.rs:489] [“failed to register addr to pd”] [body=Body(Streaming)] [“status code”=400]
[2023/11/09 10:07:22.514 +08:00] [WARN] [mod.rs:499] [“failed to register addr to pd after 5 tries”]
[2023/11/09 10:07:22.795 +08:00] [FATAL] [lib.rs:481] [“entries[6:5580] is unavailable from storage, raft_id: 92010, region_id: 92009”] [backtrace="stack backtrace:\n 0: tikv_util::set_panic_hook::{{closure}}\n at components/tikv_util/src/lib.rs:480\n 1: std::panicking::rust_panic_with_hook\n at src/libstd/panicking.rs:475\n 2: rust_begin_unwind\n at src/libstd/panicking.rs:375\n 3: std::panicking::begin_panic_fmt\n at src/libstd/panicking.rs:326\n 4: raft::raft_log::RaftLog::slice\n at home/jenkins/agent/workspace/ld_tikv_multi_branch_release-4.0/tikv/<::std::macros::panic macros>:9\n 5: raft::raft_log::RaftLog::next_entries_since\n at rust/git/checkouts/raft-rs-841f8a6db665c5c0/b5f5830/src/raft_log.rs:362\n raft::raw_node::Ready::new\n at rust/git/checkouts/raft-rs-841f8a6db665c5c0/b5f5830/src/raw_node.rs:129\n raft::raw_node::RawNode::ready_since\n at rust/git/checkouts/raft-rs-841f8a6db665c5c0/b5f5830/src/raw_node.rs:346\n

| username: Jolyne | Original post link

Looking at the logs, it seems like there is an issue with your region. Check region 92009.

| username: 小鱼吃大鱼 | Original post link

How to check

| username: TiDBer_小阿飞 | Original post link

The error here indicates that the region’s storage is unavailable. Check raft_id and region_id.

| username: Jolyne | Original post link

专栏 - Region is unavailable的排查总结 | TiDB 社区 Basic Process of Region Access

| username: 芮芮是产品 | Original post link

If you can’t resolve the issue with your region, just take this region offline.

| username: 小鱼吃大鱼 | Original post link

How to offline a region

| username: TiDBer_E7MM03rf | Original post link

Set a Region Replica to Tombstone State

The tombstone command is commonly used when sync-log is not enabled, and the Raft state machine loses some writes due to a power outage. It can set some Region replicas to Tombstone state on a TiKV instance, allowing these Regions to be skipped during restart and avoiding service startup failures due to damaged Raft state machines of these Region replicas. These Regions should have enough healthy replicas on other TiKV instances to continue read and write operations through the Raft mechanism.

Generally, you can first remove the Region replica from PD using the remove-peer command:

pd-ctl>> operator add remove-peer <region_id> <store_id>

Then use tikv-ctl on that TiKV instance to mark the Region replica as tombstone to skip the health check during startup:

tikv-ctl --data-dir /path/to/tikv tombstone -p 127.0.0.1:2379 -r <region_id>
success!

However, in some cases, when it is not convenient to remove the replica from PD, you can use the --force option of tikv-ctl to forcibly set it to tombstone:

tikv-ctl --data-dir /path/to/tikv tombstone -p 127.0.0.1:2379 -r <region_id>,<region_id> --force
success!

Note

  • This command only supports local mode
  • The parameter of the -p option specifies the PD endpoints, without the http prefix. Specifying the PD endpoints is to ask PD whether it is safe to switch to Tombstone state.
    Reference page
    TiKV Control 使用说明 | PingCAP 文档中心
| username: 小鱼吃大鱼 | Original post link

Thanks to the experts above, combined with this article 专栏 - TiDB集群恢复之TiKV集群不可用 | TiDB 社区 and the offline failure region, the issue has been resolved.

| username: Kongdom | Original post link

Make sure to mark the best answer once your issue is resolved :wink:

| username: 芮芮是产品 | Original post link

If resolved, please select the best choice.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.