Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: tikv节点异常下线
[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version]
[Reproduction Path] What operations were performed when the issue occurred
This morning, it was discovered that a TiKV node went offline abnormally around midnight, and currently, the TiDB dashboard page keeps spinning.
[Encountered Issue: Problem Phenomenon and Impact]
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]
Is this TiKV server functioning properly? Check if the dashboard process is also running on this server?
Not there, the dashboard is on the PD node, no problem. The service of the offline TiKV has already died. The logs reported the above issues and then showed a timeout when connecting to PD.
The image you provided is not visible. Please provide the text you need translated.
Is it no longer possible to restart??? This server
The server is fine, but there is an issue with the service. It seems that the problem is caused by a corrupted SST file.
Is it affecting usage now?
It definitely has an impact.
Try to fix it according to this. I’m just worried that there might be an issue with the storage medium. If the SST file corruption wasn’t caused by a power outage, it seems more likely that there is a sudden problem with a certain part of the storage.
Yes, I am following this.
Is it possible that store_id 1 is broken since all the reports are showing this?
Are there many damaged SST files?
No, I only saw one piece of information about SST, but it kept printing the failed information above.
No, I mean are there many corrupted SSTs printed out when executing the command tikv-ctl --data-dir </path/to/tikv> bad-ssts --pd ?
This check has been stuck for an hour without any response. Is this normal?
That probably means a lot of them are broken…
This… under what circumstances would a large number of SST files become corrupted?
There seem to be only 2. Is this also normally restored through SST files?
How to fix this? I know it’s damaged, but this error seems to be about incompatibility.
Your error doesn’t seem right; normally, it would generate a repair command. How many TiKV nodes and replicas does your cluster have?