TiKV Encountered a Crash and Cannot Start

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIKV遭遇down机,无法启动

| username: TiDBer_FXYGSWF8

Yesterday, while using the synchronization tool to import data, I suddenly found that the synchronization tool was interrupted, indicating that the target end was unavailable. Then, I checked the cluster on the TIDB server using the tiup cluster display command and found that three TiKV nodes were in a down state.

Subsequently, I checked the TIDB logs and found the following error:

error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.26.51.139:20160: connect: connection refused\"

Then, I logged into the problematic TiKV node and found the following error:

[FATAL] [server.rs:407] ["panic_mark_file /httx/data1/data/tikv-20160/panic_mark_file exists, there must be something wrong with the db. Do not remove the panic_mark_file and force the TiKV node to restart. Please contact TiKV maintainers to investigate the issue. If needed, use scale in and scale out to replace the TiKV node. https://docs.pingcap.com/tidb/stable/scale-tidb-using-tiup"]

After that, I repeatedly tried to start the TiDB cluster, but it was unsuccessful. For example, using the command tiup cluster start pay_tidb --wait-timeout 600 showed:

Error: failed to start tikv: failed to start: 10.26.51.139 tikv-20160.service, please check the instance's log(/httx/data1/deploy/tikv-20160/log) for more detail.: timed out waiting for port 20160 to be started after 10m0s

Could you please help me understand what the specific issue is and if there are any good solutions?

| username: xfworld | Original post link

Apart from the issue with the node instance 10.26.51.139:20160, are there any problems with the other two nodes?

What are the error logs on this node?

You can start the nodes one by one by selecting the node instances, excluding the faulty node, and prioritize scaling up.

If all three TiKV nodes have issues, you can only use unsafe recover to restore the node service, but this will result in data loss.

You can refer to:

| username: TiDBer_FXYGSWF8 | Original post link

Teacher, the other two machines are the same. What should I do to bring the cluster up now? I tried bringing them up one by one. I used the method to start the entire cluster and also tried using systemctl start tikv-20160.service, but neither was successful.

| username: ffeenn | Original post link

Post the TiKV logs and let’s take a look.

| username: xfworld | Original post link

Then you will lose data. Follow the recovery plan I provided and execute the recovery.

| username: TiDBer_FXYGSWF8 | Original post link

Okay, teacher. These are all the logs from the day of the malfunction.

tikv.rar (4.0 MB)

| username: TiDBer_FXYGSWF8 | Original post link

Teacher, I would like to ask two questions:

  1. What caused this situation? Is it because of my batch data insertion? Understanding this can help avoid the same issue in the future.
  2. Does TiDB version 6.1.5 have significant performance improvements and greater stability compared to version 5.4.3?
| username: TiDBer_FXYGSWF8 | Original post link

Teacher, I have eight TiKV nodes, and three of them have gone down. Will data be lost even if I use the method you provided?

| username: xfworld | Original post link

If you want all three replicas to be exactly on three nodes, you’re out of luck.

The key post didn’t mention your configuration, it just said there was a problem…

Version 6.1.5 is LTS and has implemented many new features that can alleviate some operational difficulties and significantly optimize performance, especially OOM. As for your situation, it’s still hard to determine.

| username: TiDBer_FXYGSWF8 | Original post link

Sorry, I didn’t post my configuration. It’s three TiDB servers with PD installed on the TiDB servers, and then eight TiKV servers. All have 24 cores and 64GB of memory. Teacher, if I’m lucky and the data is not on those three servers, what method can I use to start the cluster?

| username: TiDBer_FXYGSWF8 | Original post link

Teacher, this is the failure log of one of the downed TiKV nodes.

| username: ffeenn | Original post link

Could you provide a screenshot of the cluster display? The logs show a large number of disconnections from store 7 starting from yesterday. It’s quite difficult to recover. Is this your production environment?

| username: xfworld | Original post link

I can’t answer without knowing your configuration and cluster situation.

| username: ffeenn | Original post link

I suspect that store7 is the result of a failed offline attempt.

| username: BraveChen | Original post link

Earlier this year, in a certain environment, the logs clearly reported data corruption. I tried unsafe recover but it was unsuccessful. :innocent:

| username: TiDBer_FXYGSWF8 | Original post link


Teacher, this is it. The cluster can’t be started now.

| username: TiDBer_FXYGSWF8 | Original post link

I just don’t know why this store 7 goes offline or has issues. This is very strange. Is it because I’m initializing data? Should I also look for problems from the hardware side?

| username: ffeenn | Original post link

Here are the key points for you: what did you execute at this time, and then this instance restarted.

| username: TiDBer_FXYGSWF8 | Original post link

At this time, I am continuously synchronizing data, and the data synchronization has been going on for several days. No other operations have been performed. The operations colleague installed Zabbix monitoring, which should have been added in the morning. He probably didn’t do anything at this time either, because everyone went to eat at this time.

| username: TiDBer_FXYGSWF8 | Original post link

Teacher, should I send you the logs from the TiKV machine where store7 is located?