TiKV Node Abnormal Offline

translator_bot · June 21, 2024, 11:35pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv节点异常下线

| username: Jolyne

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version]
[Reproduction Path] What operations were performed when the issue occurred
This morning, it was discovered that a TiKV node went offline abnormally around midnight, and currently, the TiDB dashboard page keeps spinning.
[Encountered Issue: Problem Phenomenon and Impact]
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

translator_bot · June 21, 2024, 11:35pm

| username: 考试没答案 | Original post link

Is this TiKV server functioning properly? Check if the dashboard process is also running on this server?

translator_bot · June 21, 2024, 11:35pm

| username: Jolyne | Original post link

Not there, the dashboard is on the PD node, no problem. The service of the offline TiKV has already died. The logs reported the above issues and then showed a timeout when connecting to PD.

translator_bot · June 21, 2024, 11:35pm

| username: Jolyne | Original post link

The image you provided is not visible. Please provide the text you need translated.

translator_bot · June 21, 2024, 11:35pm

| username: 考试没答案 | Original post link

Is it no longer possible to restart??? This server

translator_bot · June 21, 2024, 11:35pm

| username: Jolyne | Original post link

The server is fine, but there is an issue with the service. It seems that the problem is caused by a corrupted SST file.

translator_bot · June 21, 2024, 11:35pm

| username: 考试没答案 | Original post link

Is it affecting usage now?

translator_bot · June 21, 2024, 11:35pm

| username: Jolyne | Original post link

It definitely has an impact.

translator_bot · June 21, 2024, 11:35pm

| username: 有猫万事足 | Original post link

Try to fix it according to this. I’m just worried that there might be an issue with the storage medium. If the SST file corruption wasn’t caused by a power outage, it seems more likely that there is a sudden problem with a certain part of the storage.

translator_bot · June 21, 2024, 11:35pm

| username: Jolyne | Original post link

Yes, I am following this.

Is it possible that store_id 1 is broken since all the reports are showing this?

translator_bot · June 21, 2024, 11:35pm

| username: tidb菜鸟一只 | Original post link

Are there many damaged SST files?

translator_bot · June 21, 2024, 11:35pm

| username: Jolyne | Original post link

No, I only saw one piece of information about SST, but it kept printing the failed information above.

translator_bot · June 21, 2024, 11:35pm

| username: tidb菜鸟一只 | Original post link

No, I mean are there many corrupted SSTs printed out when executing the command tikv-ctl --data-dir </path/to/tikv> bad-ssts --pd ?

translator_bot · June 21, 2024, 11:35pm

| username: Jolyne | Original post link

This check has been stuck for an hour without any response. Is this normal?

translator_bot · June 21, 2024, 11:35pm

| username: tidb菜鸟一只 | Original post link

That probably means a lot of them are broken…

translator_bot · June 21, 2024, 11:35pm

| username: Jolyne | Original post link

This… under what circumstances would a large number of SST files become corrupted?

translator_bot · June 21, 2024, 11:35pm

| username: Jolyne | Original post link

There seem to be only 2. Is this also normally restored through SST files?

translator_bot · June 21, 2024, 11:35pm

| username: redgame | Original post link

Can only try to fix it.

translator_bot · June 21, 2024, 11:35pm

| username: Jolyne | Original post link

How to fix this? I know it’s damaged, but this error seems to be about incompatibility.

translator_bot · June 21, 2024, 11:35pm

| username: tidb菜鸟一只 | Original post link

Your error doesn’t seem right; normally, it would generate a repair command. How many TiKV nodes and replicas does your cluster have?