Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: Tikv 意外崩溃, 重启失败 panic_mark_file
[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version]
5.2.3
[Reproduction Path] Operations performed that led to the issue
[2024/01/25 20:46:37.618 +08:00] [FATAL] [server.rs:405] [“panic_mark_file tidb/tidb-data/tikv-20160/panic_mark_file exists, there must be something wrong with the db. Do not remove the panic_mark_file and force the TiKV node to restart. Please contact TiKV maintainers to investigate the issue. If needed, use scale in and scale out to replace the TiKV node.
Scale a TiDB Cluster Using TiUP | PingCAP Docs”]
[Encountered Issue: Issue Phenomenon and Impact]
TiKV crashed, continuously restarting internally, manual restart failed
[Resource Configuration]
Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots / Logs / Monitoring]
It looks like a bug was triggered.
By grabbing points, replying so quickly.
To be more straightforward, you can move this panic file using the mv command.
The log indicates that it cannot be moved, right?
Also, check if there are any corrupted SST files
Okay, please wait a moment while I check.
Is this executed on the control machine? tikv-ctl --data-dir </path/to/tikv> bad-ssts --pd
So what you’re saying is: be more aggressive.
Try not to delete. You can consider scaling up or down.
Then match the keyword “panic” in the TiKV logs. Send out the stack trace; it might be a known bug.
Execute on the problematic KV
Prepare for the worst to repair data
Keywords–panic appear periodically
The nodes that failed are all HDDs, right?
No information after execution.
Execution has no return value. The disk should be fine.
Is there any useful information in the panic_mark_file? It seems from other posts that the appearance of this file indicates that an abnormal recovery is about to be prepared.
It feels like the original process file hasn’t been completely stopped yet.