TiKV fails to start during TiDB upgrade

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb升级过程中tikv启动不起来

| username: leo_zxl

[TiDB Usage Environment] Testing
[TiDB Version] v6
[Reproduction Path] Upgrade from v5.4.3 to v6.0.0 using tiup, after the upgrade, tikv fails to start
[Encountered Problem: Symptoms and Impact] tikv log error
[2023/01/10 10:14:03.782 +08:00] [ERROR] [server.rs:1075] [“failed to init io snooper”] [err_code=KV:Unknown] [err=“"IO snooper is not started due to not compiling with BCC"”]
[2023/01/10 10:14:06.768 +08:00] [FATAL] [lib.rs:465] [“open raft engine: Other("[components/raft_log_engine/src/engine.rs:373]: Corruption: unrecognized log file version: 2")”] [backtrace=" 0: tikv_util::set_panic_hook::{{closure}}\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tikv_util/src/lib.rs:464:18\n 1: std::panicking::rust_panic_with_hook\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:626:17\n 2: std::panicking::begin_panic_handler::{{closure}}\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:519:13\n 3: std::sys_common::backtrace::__rust_end_short_backtrace\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/sys_common/backtrace.rs:141:18\n 4: rust_begin_unwind\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:515:5\n 5: core::panicking::panic_fmt\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/panicking.rs:92:14\n 6: core::result::unwrap_failed\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/result.rs:1599:5\n 7: core::result::Result<T,E>::expect\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/result.rs:1241:23\n server::raft_engine_switch::check_and_dump_raft_engine\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/server/src/raft_engine_switch.rs:233:23\n server::server::TiKVServer<engine_rocks::engine::RocksEngine>::init_raw_engines\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/server/src/server.rs:1306:9\n server::server::run_tikv\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/server/src/server.rs:156:9\n 8: tikv_server::main\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/cmd/tikv-server/src/main.rs:190:5\n 9: core::ops::function::FnOnce::call_once\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/ops/function.rs:227:5\n std::sys_common::backtrace::__rust_begin_short_backtrace\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/sys_common/backtrace.rs:125:18\n 10: main\n 11: __libc_start_main\n 12: \n"] [location=components/server/src/raft_engine_switch.rs:234] [thread_name=main]
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

| username: 裤衩儿飞上天 | Original post link

Take a look at the disk usage.

| username: leo_zxl | Original post link

There are still 50G available on the disk.

| username: 裤衩儿飞上天 | Original post link

Has the upgrade been completed? Did this issue occur during the upgrade process, or did it appear after the upgrade was finished?

| username: leo_zxl | Original post link

The upgrade was not completed, and the version of TiKV is v6.0.0.

| username: 裤衩儿飞上天 | Original post link

Were there any other errors during the upgrade process? Are there any error logs? Were there any errors at the OS level for this TiKV node? Did the other nodes upgrade successfully?

| username: leo_zxl | Original post link

The upgrade failed during the process, and when starting TiKV, it indicated an upgrade failure; initially, one TiKV couldn’t start, and after a restart, all three TiKVs couldn’t start. The startup errors are the same, and there are no errors reported at the OS level.

| username: 裤衩儿飞上天 | Original post link

  1. Is it an online upgrade or an offline upgrade? If it’s an online upgrade, were there any other operations during the process? Did anyone else log into the database to perform any DDL operations?
  2. It is recommended to share the upgrade process, cluster topology, relevant commands used, error logs, etc.
| username: leo_zxl | Original post link

Non-stop upgrade, there should be no DDL operations during the upgrade process. Initially, it was supposed to upgrade to v6.1.3, but the upgrade failed. Later, it was changed to upgrade to v6.0.0, but the upgrade still failed. Cluster topology:

| username: leo_zxl | Original post link

tiup-cluster-debug-2023-01-09-17-07-47.log (611.0 KB)

| username: 裤衩儿飞上天 | Original post link

  1. TiDB cannot be rolled back during the upgrade process.
    I don’t know what errors you encountered when upgrading from 5.4.3 to 6.1.3, and whether any nodes were successfully upgraded.
    Then you used version 6.0 to upgrade the cluster that was not fully upgraded, which will definitely lead to unpredictable errors.
  2. Judging from the error, it seems that the log file at the RocksDB layer cannot be recognized, causing TiKV to fail to start.
  3. If you still want to continue, I personally feel that you can try to continue upgrading with 6.1.3, but I suggest you see if other experts have better solutions.
  4. It is not recommended to use the DMR version for upgrades. It is fine for personal testing, but the LTS version 6.1 or 6.5 is recommended.
| username: leo_zxl | Original post link

OK.

| username: leo_zxl | Original post link

Manually upgrading to a higher version worked, :+1:t2::+1:t2::+1:t2::+1:t2::+1:t2::+1:t2:

| username: 裤衩儿飞上天 | Original post link

:+1: :+1: :+1:
It is recommended to back up the data first after starting up to prevent unforeseen errors.