TiKV Restart Failure Due to Log Opening Error

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv重启失败报日志打开错误

| username: Steve阿辉

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.5.0
[Reproduction Path] Operations performed that caused the issue: Recently, the database had some anomalies and automatically restarted the TiKV and TiDB nodes. Therefore, a restart test was conducted at midnight to check if the database had any potential issues.
[Encountered Issue: Problem Phenomenon and Impact] One of the three TiKV nodes cannot be opened.
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page

[Attachments: Screenshots/Logs/Monitoring]

[2023/07/15 04:32:57.104 +08:00] [INFO] [mod.rs:126] [“encryption: none of key dictionary and file dictionary are found.”]
[2023/07/15 04:32:57.104 +08:00] [INFO] [mod.rs:509] [“encryption is disabled.”]
[2023/07/15 04:32:57.491 +08:00] [ERROR] [pipe_builder.rs:354] [“Failed to open log file due to broken entry: Append:17000 offset=132020349”]
[2023/07/15 04:32:57.492 +08:00] [ERROR] [pipe_builder.rs:354] [“Failed to open log file due to broken entry: Append:17014 offset=122139864”]
[2023/07/15 04:33:00.581 +08:00] [FATAL] [lib.rs:495] [“failed to open raft engine: Other("[components/raft_log_engine/src/engine.rs:646]: Corruption: Log item offset is smaller than log batch header length")”] [backtrace=" 0: tikv_util::set_panic_hook::{{closure}}\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tikv_util/src/lib.rs:494:18\n 1: <alloc::boxed::Box<F,A> as core::ops::function::Fn>::call\n at /rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/boxed.rs:2032:9\n std::panicking::rust_panic_with_hook\n at /rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:692:13\n 2: std::panicking::begin_panic_handler::{{closure}}\n at /rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:579:13\n 3: std::sys_common::backtrace::__rust_end_short_backtrace\n at /rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys_common/backtrace.rs:137:18\n 4: rust_begin_unwind\n at /rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:575:5\n 5: core::panicking::panic_fmt\n at /rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/panicking.rs:65:14\n 6: core::result::unwrap_failed\n at /rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/result.rs:1791:5\n 7: core::result::Result<T,E>::expect\n at /rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/result.rs:1070:23\n <raft_log_engine::engine::RaftLogEngine as server::server::ConfiguredRaftEngine>::build\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/server/src/server.rs:1772:13\n server::server::TikvServer::init_raw_engines\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/server/src/server.rs:1807:27\n 8: server::server::run_impl\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/server/src/server.rs:162:35\n server::server::run_tikv\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/server/src/server.rs:197:5\n 9: tikv_server::main\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/cmd/tikv-server/src/main.rs:210:5\n 10: core::ops::function::FnOnce::call_once\n at /rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:513:5\n std::sys_common::backtrace::__rust_begin_short_backtrace\n at /rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys_common/backtrace.rs:121:18\n 11: main\n 12: __libc_start_main\n 13: \n"] [location=components/server/src/server.rs:1773] [thread_name=main]

Earlier error logs:
[2023/07/15 04:20:45.223 +08:00] [ERROR] [pipe_builder.rs:354] [“Failed to open log file due to broken entry: Append:17014 offset=122139864”]
[2023/07/15 04:20:45.317 +08:00] [ERROR] [pipe_builder.rs:354] [“Failed to open log file due to broken entry: Append:17000 offset=132020349”]

[2023/07/15 04:17:13.119 +08:00] [ERROR] [pipe_builder.rs:354] [“Failed to open log file due to broken entry: Append:17014 offset=122139864”]
[2023/07/15 04:17:13.143 +08:00] [ERROR] [pipe_builder.rs:354] [“Failed to open log file due to broken entry: Append:17000 offset=132020349”]

[2023/07/15 03:57:42.337 +08:00] [ERROR] [pipe_builder.rs:354] [“Failed to open log file due to broken entry: Append:17014 offset=122139864”]
[2023/07/15 03:57:42.386 +08:00] [ERROR] [pipe_builder.rs:354] [“Failed to open log file due to broken entry: Append:17000 offset=132020349”]

[2023/07/15 03:54:29.033 +08:00] [ERROR] [pipe_builder.rs:354] [“Failed to open log file due to broken entry: Append:17014 offset=122139864”]
[2023/07/15 03:54:29.098 +08:00] [ERROR] [pipe_builder.rs:354] [“Failed to open log file due to broken entry: Append:17000 offset=132020349”]

[2023/07/15 03:54:15.386 +08:00] [INFO] [node.rs:548] [“stop raft store thread”] [store_id=4]
[2023/07/15 03:54:15.386 +08:00] [INFO] [batch.rs:635] [“shutdown batch system apply”]
[2023/07/15 03:54:15.386 +08:00] [INFO] [router.rs:330] [“broadcasting shutdown”]
[2023/07/15 03:54:15.390 +08:00] [INFO] [mod.rs:237] [“Storage stopped.”]
[2023/07/15 03:54:15.390 +08:00] [ERROR] [pd.rs:102] [“Failed to send read flow statistics”] [err=“channel has been closed”]
[2023/07/15 03:54:15.390 +08:00] [ERROR] [pd.rs:102] [“Failed to send read flow statistics”] [err=“channel has been closed”]
[2023/07/15 03:54:15.413 +08:00] [INFO] [store.rs:2808] [“broadcasting unreachable”] [unreachable_store_id=5] [store_id=4]
[2023/07/15 03:54:15.441 +08:00] [ERROR] [pd.rs:102] [“Failed to send read flow statistics”] [err=“channel has been closed”]
[2023/07/15 03:54:15.441 +08:00] [ERROR] [pd.rs:102] [“Failed to send read flow statistics”] [err=“channel has been closed”]
[2023/07/15 03:54:15.442 +08:00] [ERROR] [pd.rs:102] [“Failed to send read flow statistics”] [err=“channel has been closed”]
[2023/07/15 03:54:15.444 +08:00] [ERROR] [pd.rs:102] [“Failed to send read flow statistics”] [err=“channel has been closed”]
[2023/07/15 03:54:15.445 +08:00] [ERROR] [pd.rs:102] [“Failed to send read flow statistics”] [err=“channel has been closed”]
[2023/07/15 03:54:15.445 +08:00] [ERROR] [pd.rs:102] [“Failed to send read flow statistics”] [err=“channel has been closed”]
[2023/07/15 03:54:15.445 +08:00] [ERROR] [pd.rs:102] [“Failed to send read flow statistics”] [err=“channel has been closed”]
[2023/07/15 03:54:15.445 +08:00] [ERROR] [pd.rs:102] [“Failed to send read flow statistics”] [err=“channel has been closed”]

[2023/07/15 03:54:15.342 +08:00] [ERROR] [raft_client.rs:562] [“connection aborted”] [addr=172.16.16.81:20160] [receiver_err=“Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "Socket closed", details: }))”] [sink_error=“Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: "Socket closed", details: })))”] [store_id=1]
[2023/07/15 03:54:15.344 +08:00] [ERROR] [raft_client.rs:858] [“connection abort”] [addr=172.16.16.81:20160] [store_id=1]
[2023/07/15 03:54:15.344 +08:00] [INFO] [store.rs:2808] [“broadcasting unreachable”] [unreachable_store_id=1] [store_id=4]
[2023/07/15 03:54:15.347 +08:00] [ERROR] [raft_client.rs:562] [“connection aborted”] [addr=172.16.16.82:20160] [receiver_err=“Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "Socket closed", details: }))”] [sink_error=“Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: "Socket closed", details: })))”] [store_id=5]
[2023/07/15 03:54:15.348 +08:00] [ERROR] [raft_client.rs:858] [“connection abort”] [addr=172.16.16.82:20160] [store_id=5]
[2023/07/15 03:54:15.351 +08:00] [WARN] [deadlock.rs:1007] [“deadlock detect rpc stream disconnected”] [error=Grpc(RpcFinished(None))]
[2023/07/15 03:54:15.351 +08:00] [WARN] [deadlock.rs:1007] [“deadlock detect rpc stream disconnected”] [error=Grpc(RpcFinished(None))]
[2023/07/15 03:54:15.351 +08:00] [INFO] [raft_client.rs:790] [“resolve store address ok”] [addr=172.16.16.81:20160] [store_id=1]
[2023/07/15 03:54:15.351 +08:00] [INFO] [raft_client.rs:675] [“server: new connection with tikv endpoint”] [store_id=1] [addr=172.16.16.81:20160]
[2023/07/15 03:54:15.369 +08:00] [INFO] [] [“subchannel 0x7f2297069400 {address=ipv4:172.16.16.81:20160, args=grpc.client_channel_factory=0x7f22c6c921d0, grpc.default_authority=172.16.16.81:20160, grpc.default_compression_algorithm=0, grpc.gprc_min_message_size_to_compress=4096, grpc.gzip_compression_level=2, grpc.http2.lookahead_bytes=2097152, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f22c6c389a0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_reconnect_backoff_ms=5000, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f22c6cb50e0, grpc.server_uri=dns:///172.16.16.81:20160, random id=4}: connect failed: {"created":"@1689364455.361808780","description":"Failed to connect to remote host: Connection refused","errno":111,"file":"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.10.3+1.44.0-patched/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":200,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:172.16.16.81:20160"}”]
[2023/07/15 03:54:15.369 +08:00] [INFO] [] [“subchannel 0x7f2297069400 {address=ipv4:172.16.16.81:20160, args=grpc.client_channel_factory=0x7f22c6c921d0, grpc.default_authority=172.16.16.81:20160, grpc.default_compression_algorithm=0, grpc.gprc_min_message_size_to_compress=4096, grpc.gzip_compression_level=2, grpc.http2.lookahead_bytes=2097152, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f22c6c389a0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_reconnect_backoff_ms=5000, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f22c6cb50e0, grpc.server_uri=dns:///172.16.16.81:20160, random id=4}: Retry in 982 milliseconds”]
[2023/07/15 03:54:15.384 +08:00] [ERROR] [kv.rs:671] [“KvService::batch_raft send response fail”] [err=RemoteStopped]
[2023/07/15 03:54:15.384 +08:00] [ERROR] [kv.rs:671] [“KvService::batch_raft send response fail”] [err=RemoteStopped]
[2023/07/15 03:54:15.384 +08:00] [INFO] [raft_client.rs:790] [“resolve store address ok”] [addr=172.16.16.82:20160] [store_id=5]
[2023/07/15 03:54:15.384 +08:00] [INFO] [raft_client.rs:675] [“server: new connection with tikv endpoint”] [store_id=5] [addr=172.16.16.82:20160]
[2023/07/15 03:54:15.385 +08:00] [INFO] [] ["subchannel 0x7f2297069c00 {address=ipv4:172.16.16.82:20160, args=grpc.client_channel_factory=0x7f22c6c921d0, grpc.default_authority=172.16.16.82:20160, grpc.default_compression_algorithm=0, grpc.gprc_min_message_size_to_compress=4096, grpc.gzip_compression_level=2, grpc.http2.lookahead_bytes=2097152, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f22c6c389a0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_reconnect_backoff_ms=5000, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f22c6cb50e0, grpc.server_uri=dns:///172.16.16.82:20160, random id=5}: connect

| username: zhanggame1 | Original post link

Take a look at this issue, others have encountered it as well:
TiKV panic: Failed to open log file due to broken entry: Append:32 offset=21277880 · Issue #14304 · tikv/tikv · GitHub

It’s likely that the expansion and subsequent removal of this TiKV caused the problem.

| username: Steve阿辉 | Original post link

Thank you, the issue in the post is quite similar, but I don’t want to remove and rebuild everything for every small problem, as it is too time-consuming and labor-intensive. Why can such a small issue cause the TiKV node to crash directly? Isn’t there a way to fix it?

| username: zhanggame1 | Original post link

Judging from the logs, it seems to be log corruption, which is a serious issue in any database.

| username: 有猫万事足 | Original post link

You can try
tikv-ctl --data-dir /path/to/tikv bad-regions

and then mark the corresponding region as tombstone.

| username: redgame | Original post link

It may be caused by corrupted log files. You can manually delete the corresponding log files, specifically those containing “Append:17000” and “Append:17014”. Make sure to back them up before deleting.

| username: tidb狂热爱好者 | Original post link

You didn’t shut down normally; you had a sudden power outage. It’s no wonder something broke.
The proper way is to use service tikv-20160 stop; reboot;.
How could you just restart directly?

| username: zhanggame1 | Original post link

Generally, it’s fine. I’ve forcefully shut down various databases many times and they all came back up. For the TiDB virtual machine cluster I test, I always shut it down directly after use. As long as the hardware is fine, it’s usually hard to encounter problems.

| username: tidb狂热爱好者 | Original post link

The machine is handling business operations and is constantly writing data. Forcibly shutting it down will result in data loss. Adding and removing nodes will solve the issue.

| username: Steve阿辉 | Original post link

I am using the command line on tiup to restart the database. It will stop first, and then start again.

| username: Steve阿辉 | Original post link

How exactly do I operate this? Where is this log file located?

| username: Steve阿辉 | Original post link

We are using cloud servers, and TiDB has had many issues already. It feels very fragile, and there are no systematic automatic solutions and maintenance mechanisms to ensure the reliability of the system.

| username: Steve阿辉 | Original post link

At that time, we hardly wrote anything; it was done at 4 AM. This is not a forced shutdown; it’s a restart using the tiup command line.