TiKV Node Abnormal Offline

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv节点异常下线

| username: Jolyne

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version]
[Reproduction Path] What operations were performed when the issue occurred
This morning, it was discovered that a TiKV node went offline abnormally around midnight, and currently, the TiDB dashboard page keeps spinning.
[Encountered Issue: Problem Phenomenon and Impact]
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]



| username: 考试没答案 | Original post link

Is this TiKV server functioning properly? Check if the dashboard process is also running on this server?

| username: Jolyne | Original post link

Not there, the dashboard is on the PD node, no problem. The service of the offline TiKV has already died. The logs reported the above issues and then showed a timeout when connecting to PD.

| username: Jolyne | Original post link

The image you provided is not visible. Please provide the text you need translated.

| username: 考试没答案 | Original post link

Is it no longer possible to restart??? This server

| username: Jolyne | Original post link

The server is fine, but there is an issue with the service. It seems that the problem is caused by a corrupted SST file.

| username: 考试没答案 | Original post link

Is it affecting usage now?

| username: Jolyne | Original post link

:joy: It definitely has an impact.

| username: 有猫万事足 | Original post link

Try to fix it according to this. I’m just worried that there might be an issue with the storage medium. If the SST file corruption wasn’t caused by a power outage, it seems more likely that there is a sudden problem with a certain part of the storage.

| username: Jolyne | Original post link

Yes, I am following this.


Is it possible that store_id 1 is broken since all the reports are showing this?

| username: tidb菜鸟一只 | Original post link

Are there many damaged SST files?

| username: Jolyne | Original post link

No, I only saw one piece of information about SST, but it kept printing the failed information above.

| username: tidb菜鸟一只 | Original post link

No, I mean are there many corrupted SSTs printed out when executing the command tikv-ctl --data-dir </path/to/tikv> bad-ssts --pd ?

| username: Jolyne | Original post link

This check has been stuck for an hour without any response. Is this normal?

| username: tidb菜鸟一只 | Original post link

That probably means a lot of them are broken…

| username: Jolyne | Original post link

This… under what circumstances would a large number of SST files become corrupted?

| username: Jolyne | Original post link

There seem to be only 2. Is this also normally restored through SST files?

| username: redgame | Original post link

Can only try to fix it.

| username: Jolyne | Original post link

How to fix this? I know it’s damaged, but this error seems to be about incompatibility.

| username: tidb菜鸟一只 | Original post link

Your error doesn’t seem right; normally, it would generate a repair command. How many TiKV nodes and replicas does your cluster have?