TiDB server node fails to start?

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb_server 节点启动不起来?

| username: xiaoxiaozuofang

【TiDB Usage Environment】Testing/
【TiDB Version】tidb v6.1.0
【Reproduction Path】Unexpected power outage in the internal network data center
【Encountered Issues: Problem Phenomenon and Impact】
【Resource Configuration】
【Attachments: Screenshots/Logs/Monitoring】

| username: Kongdom | Original post link

It looks like a three-node mixed deployment. It might simply be a startup timeout. You can check the cluster status with display after some time.

| username: zhanggame1 | Original post link

Check if port 4000 is occupied on the 3 TiDB server machines using the command: netstat -na | grep 4000

| username: xfworld | Original post link

What is the error above?

| username: 春风十里 | Original post link

I see a lot of “connection refused” and “No route to host” errors to different TiKV instances. It seems likely that the network ports are not open.
Try telnet 20160
Try telnet 20160
If they are not open, please check on both sides whether the firewall is enabled with systemctl status firewalld. If the firewall is not enabled but the connection is still not open, then you need to ask the network team to check.

| username: Jellybean | Original post link

The stateful TiKV and PD in the cluster have started correctly, but only the stateless TiDB has not started. It doesn’t seem to be an internal cluster issue; it looks more like an external network configuration problem.

Check the network connectivity or port access issues for TiDB. After the data center power outage and restart, it’s uncertain if there have been any changes to firewall or network settings.

| username: 小龙虾爱大龙虾 | Original post link

It seems that it can’t connect to TiKV.

| username: dba远航 | Original post link

Use tiup cluster edit-config <cluster-name> to check if the TiDB configuration path is correct.

| username: tidb菜鸟一只 | Original post link

To be fair, once PD and TiKV are up, it’s generally very difficult for the TiDB server not to start, unless 1) resources are insufficient, or 2) the network is down.

| username: Kongdom | Original post link

:yum: I encountered this once. It wasn’t due to insufficient resources; it was just slow to start. Initially, I restarted it twice in a row, but it didn’t start up. After the third restart, it reported an error, and because I was replying to a message, I came back to the display after a while and saw that it had successfully started up.

| username: tidb菜鸟一只 | Original post link

Oh, indeed, some versions require reloading statistics, which can be relatively slow, but it won’t completely fail to start. Moreover, a mechanism was later added to terminate the loading process if it takes too long and directly proceed with the startup…

| username: Kongdom | Original post link

:thinking: Yes, that’s the reason. The message above indicates a 2-minute timeout. When tiup detects a 2-minute timeout, it will stop starting subsequent components but will not request to shut down the current component. So after a while, you will see the component is up when you run display.

| username: andone | Original post link

Try using telnet on the port to see if the port number is already in use.

| username: 像风一样的男子 | Original post link

I guess the firewall had issues after the server reboot, and the ports were blocked.

| username: oceanzhang | Original post link

Startup timed out, and there are no obvious errors, just can’t connect. Check if there are enough resources.

| username: xiaoxiaozuofang | Original post link

Indeed, it was the firewall interception that caused the KV node port to be unreachable.

| username: zhanggame1 | Original post link

In the end, it was indeed the network that was not connected.

| username: Kongdom | Original post link

Uh… I really didn’t expect this, it turned out to be a firewall issue.

| username: oceanzhang | Original post link

Didn’t you use tiup for deployment?

| username: Kongdom | Original post link

:thinking: It was deployed using tiup, but this should have nothing to do with the timeout. If hardware resources are insufficient, any deployment method will result in errors.