TiDB fails to start after a power outage

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 断电后tidb启动不了

| username: 星空之痕

The power outage caused the issue, and my ability is limited. I can’t identify the problem from the logs.

| username: xiaohetao | Original post link

The tikv on 192.168.1.202 cannot start.

| username: xiaohetao | Original post link

Check the log contents under /tidb-deploy/tikv-20161/log on 192.168.1.202.

| username: tidb狂热爱好者 | Original post link

tiup cluster start sandata -N 192.168.2.84:2379
Normally, PD will be called up, and TiKV will start subsequently.
If you deploy all tasks on a single machine, it is recommended to increase the machine’s memory to around 64GB, as there might be OOM issues during startup.

Stopping TiDB Cluster Nodes

First, use the “tiup cluster display” command to view the TiDB cluster information. The IP:PORT shown in the ID column can be used as the node name. The process is as follows:


Next, refer to the stop cluster command “tiup cluster stop -h” for help information to get the command syntax for stopping a specific server. From the help, you can see that the -N option is the function we need. The process is as follows:

After understanding the command, proceed to stop the PD node server “192.168.2.81:2379” with the command: “tiup cluster stop sandata -N 192.168.2.81:2379”. The process is as follows:

After the stop command is successfully executed, use “tiup cluster display sandata” again to check the cluster status. You can see that the 192.168.2.81:2379 node is in the Down state. The process is as follows:

Next, use the same command “tiup cluster stop sandata -N 192.168.2.81:20160” to stop the TiKV node server 192.168.2.84:20160. The process is as follows:

After the command is successful, the TiKV node 192.168.2.84:20160 changes to Disconnected status. The process is as follows:

After waiting for a while, the TiKV node status will change to Down. This state does not affect the normal use of TiDB. The process is as follows:

At this point, we have completed the task of stopping the TiDB cluster node server.

Starting TiDB Cluster Nodes

Next, we will restore the stopped nodes one by one. First, use the “tiup cluster start sandata -N 192.168.2.84:20160” command to start the TiKV node. The process is as follows:


Then use the “tiup cluster start sandata -N 192.168.2.84:2379” command to start the PD node. The process is as follows:

Finally, use the “tiup cluster display sandata” command to check the TiDB cluster status. All nodes should be in the UP state. The process is as follows:

Summary

Through this article, we can use TiUP to start and stop nodes in the TiDB cluster. This process does not affect the normal operation of other nodes and demonstrates the convenience and ease of use of TiUP.

| username: 星空之痕 | Original post link

Could you take a look at these, please?

| username: xiaohetao | Original post link

What time was the power cut off, and what time was it restarted?

| username: xiaohetao | Original post link

The tikv.log shows a system error, unable to connect to the host.

In the pd.log, this segment indicates that the file cannot be found.

The tikv shows that it cannot connect to the host, and at the same time, pd cannot connect to kv. The pd log shows that the file cannot be found, which may be related to system resources and needs further investigation.

| username: xiaohetao | Original post link

Is this a test environment?
In a typical production environment configuration, TiKV and PD are usually placed on different hosts, and there are three or more TiKV instances to avoid data loss and cluster startup failure in case of anomalies.

| username: 星空之痕 | Original post link

Somewhere between testing and production, the new data center doesn’t have a UPS and frequently experiences power outages. Additionally, there’s no air conditioning, so overheating might also cause shutdowns.

| username: tidb狂热爱好者 | Original post link

In this WeChat group, I had a detailed conversation with him. It was the leveldb database of the PD component that got corrupted due to an unexpected power outage on the EXSi, which caused incomplete data saving. Currently, the other party has only one PD server, which has already been rebuilt. They have added the PD nodes to three servers to enhance the fault tolerance of the PD nodes.

| username: Mark | Original post link

The server room has no air conditioning… the servers have overheated and passed out :joy:

| username: xiaohetao | Original post link

:+1::+1::+1:

| username: Kongdom | Original post link

I haven’t encountered a similar situation so far. The only similar instance was due to enabling the RAID card cache, but because the RAID card didn’t have a battery, a power outage caused the cache to be lost, which in turn caused TiDB to fail to start.

| username: xuexiaogang | Original post link

What a coincidence, I also experienced a power outage last time and wrote an article about it on the forum. However, I couldn’t fix it either and ended up rebuilding it.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.