After successfully deploying the TiDB cluster, it gets stuck during startup and reports an error after a long time

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb集群部署成功后,启动时卡住,过了很久报错了。

| username: 哈喽沃德

[Test Environment for TiDB] Testing
[TiDB Version] 7.4.0
[Encountered Issue: Phenomenon and Impact] Unable to start the cluster
[Resource Configuration] Single machine deployment with pseudo-cluster
[Attachments: Screenshots/Logs/Monitoring]


tiup-cluster-debug-2023-10-24-12-28-12.log (60.3 KB)

| username: 啦啦啦啦啦 | Original post link

According to the prompt, check the logs of the corresponding TiKV node.

| username: 哈喽沃德 | Original post link

There is no content in the tikv_stderr.log file under the corresponding tikv node. The content of tikv.log is as follows:
tikv.log (206.2 KB)

| username: tidb菜鸟一只 | Original post link

For single-machine deployment, check if the host resources are insufficient. Use tiup cluster display clustername to see if PD has already crashed…

| username: zhanggame1 | Original post link

For a single-machine deployment, you need at least 12GB of memory. I’m not sure if you’ve allocated enough. If the memory is insufficient, do not deploy TiFlash, and just deploy one TiKV.

| username: Kongdom | Original post link

There is a 10-minute timeout period. If there is no response within 10 minutes, it is considered a failure. This situation may occur when there are insufficient resources for a single-machine deployment, but in reality, it has started successfully. You still need to check the logs under the corresponding TiKV node to confirm. If there are no logs, it is simply a slow startup, so please wait a bit longer.

| username: 路在何chu | Original post link

Please post the TiKV logs.

| username: 像风一样的男子 | Original post link

Single-node deployment means each component has one node. A single replica for KV is sufficient.

| username: Fly-bird | Original post link

The KV service didn’t start, try starting it manually.

| username: Kongdom | Original post link

It seems that TiKV has a welcome message, which indicates it has started. Use the display command to check the cluster status.

| username: 哈喽沃德 | Original post link

Re-executed and it’s no longer stuck at TiKV, but now it’s stuck at the following. Could it be that starting the cluster is related to the network? Accessing from the local machine.

| username: 像风一样的男子 | Original post link

Monitor the server resource status.

| username: 哈喽沃德 | Original post link

Is the disk sufficient, but the memory not enough?

| username: 像风一样的男子 | Original post link

Bro, you’re planning to set up a TiDB with 4GB of RAM? It won’t even start. Even laptops now come with 16GB of RAM.

| username: 哈喽沃德 | Original post link

[root@hdty-dmdca log]# tiup cluster start tidb-cluster
tiup is checking updates for component cluster …
Starting component cluster: /data/components/cluster/v1.13.1/tiup-cluster start tidb-cluster
Starting cluster tidb-cluster…

  • [ Serial ] - SSHKeySet: privateKey=/data/storage/cluster/clusters/tidb-cluster/ssh/id_rsa, publicKey=/data/storage/cluster/clusters/tidb-cluster/ssh/id_rsa.pub
  • [Parallel] - UserSSH: user=tidb, host=172.16.60.94
  • [Parallel] - UserSSH: user=tidb, host=172.16.60.94
  • [Parallel] - UserSSH: user=tidb, host=172.16.60.94
  • [Parallel] - UserSSH: user=tidb, host=172.16.60.94
  • [Parallel] - UserSSH: user=tidb, host=172.16.60.94
  • [Parallel] - UserSSH: user=tidb, host=172.16.60.94
  • [Parallel] - UserSSH: user=tidb, host=172.16.60.94
  • [Parallel] - UserSSH: user=tidb, host=172.16.60.94
  • [ Serial ] - StartCluster
    Starting component pd
    Starting instance 172.16.60.94:2379
    Start instance 172.16.60.94:2379 success
    Starting component tikv
    Starting instance 172.16.60.94:20160
    Starting instance 172.16.60.94:20162
    Starting instance 172.16.60.94:20161
    Start instance 172.16.60.94:20160 success
    Start instance 172.16.60.94:20161 success
    Start instance 172.16.60.94:20162 success
    Starting component tidb
    Starting instance 172.16.60.94:4000
    Start instance 172.16.60.94:4000 success
    Starting component tiflash
    Starting instance 172.16.60.94:9000
    Start instance 172.16.60.94:9000 success
    Starting component prometheus
    Starting instance 172.16.60.94:9090
    Start instance 172.16.60.94:9090 success
    Starting component grafana
    Starting instance 172.16.60.94:3000
    Start instance 172.16.60.94:3000 success
    Starting component node_exporter
    Starting instance 172.16.60.94
    Start 172.16.60.94 success
    Starting component blackbox_exporter
    Starting instance 172.16.60.94
    Start 172.16.60.94 success
  • [ Serial ] - UpdateTopology: cluster=tidb-cluster
    Started cluster tidb-cluster successfully
    [root@hdty-dmdca log]#
| username: 哈喽沃德 | Original post link

After running it a few more times, it actually started up.

| username: tidb菜鸟一只 | Original post link

With 4GB of memory, check the memory usage after you start up. I estimate that you will soon be unable to connect to the machine. Once the memory is used up, you won’t even be able to connect remotely.

| username: 哈喽沃德 | Original post link

[root@hdty-dmdca log]# free -m
total used free shared buff/cache available
Mem: 4675 3904 125 50 645 437
Swap: 8191 6571 1620
[root@hdty-dmdca log]#

| username: 随缘天空 | Original post link

The configuration is too low. I encountered the same problem before. You can refer to the following link: 快速上手TiDB--在单机上模拟部署生产环境集群--启动失败 - #21,来自 有猫万事足 - TiDB 的问答社区

| username: 哈喽沃德 | Original post link

There is still 600M available, and 6G of swap has been used.