Yesterday, while using the synchronization tool to import data, I suddenly found that the synchronization tool was interrupted, indicating that the target end was unavailable. Then, I checked the cluster on the TIDB server using the tiup cluster display command and found that three TiKV nodes were in a down state.
Subsequently, I checked the TIDB logs and found the following error:
Then, I logged into the problematic TiKV node and found the following error:
[FATAL] [server.rs:407] ["panic_mark_file /httx/data1/data/tikv-20160/panic_mark_file exists, there must be something wrong with the db. Do not remove the panic_mark_file and force the TiKV node to restart. Please contact TiKV maintainers to investigate the issue. If needed, use scale in and scale out to replace the TiKV node. https://docs.pingcap.com/tidb/stable/scale-tidb-using-tiup"]
After that, I repeatedly tried to start the TiDB cluster, but it was unsuccessful. For example, using the command tiup cluster start pay_tidb --wait-timeout 600 showed:
Error: failed to start tikv: failed to start: 10.26.51.139 tikv-20160.service, please check the instance's log(/httx/data1/deploy/tikv-20160/log) for more detail.: timed out waiting for port 20160 to be started after 10m0s
Could you please help me understand what the specific issue is and if there are any good solutions?
Teacher, the other two machines are the same. What should I do to bring the cluster up now? I tried bringing them up one by one. I used the method to start the entire cluster and also tried using systemctl start tikv-20160.service, but neither was successful.
If you want all three replicas to be exactly on three nodes, you’re out of luck.
The key post didn’t mention your configuration, it just said there was a problem…
Version 6.1.5 is LTS and has implemented many new features that can alleviate some operational difficulties and significantly optimize performance, especially OOM. As for your situation, it’s still hard to determine.
Sorry, I didn’t post my configuration. It’s three TiDB servers with PD installed on the TiDB servers, and then eight TiKV servers. All have 24 cores and 64GB of memory. Teacher, if I’m lucky and the data is not on those three servers, what method can I use to start the cluster?
Could you provide a screenshot of the cluster display? The logs show a large number of disconnections from store 7 starting from yesterday. It’s quite difficult to recover. Is this your production environment?
I just don’t know why this store 7 goes offline or has issues. This is very strange. Is it because I’m initializing data? Should I also look for problems from the hardware side?
At this time, I am continuously synchronizing data, and the data synchronization has been going on for several days. No other operations have been performed. The operations colleague installed Zabbix monitoring, which should have been added in the morning. He probably didn’t do anything at this time either, because everyone went to eat at this time.