TiKV Unexpected Crash, Restart Failed panic_mark_file

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: Tikv 意外崩溃, 重启失败 panic_mark_file

| username: 最强王者

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version]
5.2.3
[Reproduction Path] Operations performed that led to the issue


[2024/01/25 20:46:37.618 +08:00] [FATAL] [server.rs:405] [“panic_mark_file tidb/tidb-data/tikv-20160/panic_mark_file exists, there must be something wrong with the db. Do not remove the panic_mark_file and force the TiKV node to restart. Please contact TiKV maintainers to investigate the issue. If needed, use scale in and scale out to replace the TiKV node. Scale a TiDB Cluster Using TiUP | PingCAP Docs”]
[Encountered Issue: Issue Phenomenon and Impact]
TiKV crashed, continuously restarting internally, manual restart failed
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots / Logs / Monitoring]

| username: 像风一样的男子 | Original post link

It looks like a bug was triggered.

| username: tidb狂热爱好者 | Original post link

By grabbing points, replying so quickly.

| username: WalterWj | Original post link

To be more straightforward, you can move this panic file using the mv command.

| username: 最强王者 | Original post link

The log indicates that it cannot be moved, right?

| username: 像风一样的男子 | Original post link

Also, check if there are any corrupted SST files

| username: 最强王者 | Original post link

Okay, please wait a moment while I check.

| username: 最强王者 | Original post link

Is this executed on the control machine? tikv-ctl --data-dir </path/to/tikv> bad-ssts --pd

| username: WalterWj | Original post link

So what you’re saying is: be more aggressive.

:smile: Try not to delete. You can consider scaling up or down.
Then match the keyword “panic” in the TiKV logs. Send out the stack trace; it might be a known bug.

| username: 像风一样的男子 | Original post link

Execute on the problematic KV

| username: 像风一样的男子 | Original post link

Prepare for the worst to repair data

| username: 最强王者 | Original post link

Keywords–panic appear periodically

| username: 最强王者 | Original post link

Okay, thank you.

| username: tidb狂热爱好者 | Original post link

The nodes that failed are all HDDs, right?

| username: 最强王者 | Original post link

No information after execution.

| username: 最强王者 | Original post link

It is SSD.

| username: 最强王者 | Original post link

Execution has no return value. The disk should be fine.

| username: tidb狂热爱好者 | Original post link

Then try removing it.

| username: 江湖故人 | Original post link

Is there any useful information in the panic_mark_file? It seems from other posts that the appearance of this file indicates that an abnormal recovery is about to be prepared.

| username: dba远航 | Original post link

It feels like the original process file hasn’t been completely stopped yet.