Error when upgrading TiDB from 5.3 to 6.1, caused by the monitoring module. Can anyone help?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb5.3升级6.1时报错,监控模块引起的错误。大家帮忙看下?

| username: TiDB_New_People

Error occurred during the upgrade from TiDB 5.3 to 6.1, caused by the monitoring module. Can anyone help take a look?

Logs are as follows:
2022-08-08T21:28:49.952+0800 INFO SSHCommand {“host”: “192.168.1.3”, “port”: “22”, “cmd”: “export LANG=C; PATH=$PATH:/bin:/sbin:/usr/bin:/usr/sbin ss -ltn”, “stdout”: "State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 :22 :
LISTEN 0 100 127.0.0.1:25 :
LISTEN 0 100 :18686 :
LISTEN 0 40000 192.168.1.3:9093 :
LISTEN 0 40000 192.168.1.3:9094 :
LISTEN 0 128 127.0.0.1:1234 :
LISTEN 0 40000 [::]:12020 [::]:

LISTEN 0 128 [::]:22 [::]:

LISTEN 0 40000 [::]:3000 [::]:*
LISTEN 0 100 [::1]:25 [::]:*
LISTEN 0 40000 [::]:9090 [::]:*
LISTEN 0 128 [::]:40324 [::]:*
LISTEN 0 128 [::]:6123 [::]:*
LISTEN 0 128 [::]:36429 [::]:*
LISTEN 0 128 [::]:8081 [::]:*
", “stderr”: “”}
2022-08-08T21:28:49.952+0800 INFO CheckPoint {“host”: “192.168.1.3”, “port”: 22, “user”: “tidb”, “sudo”: false, “cmd”: “ss -ltn”, “stdout”: "State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 :22 :
LISTEN 0 100 127.0.0.1:25 :
LISTEN 0 100 :18686 :
LISTEN 0 40000 192.168.1.3:9093 :
LISTEN 0 40000 192.168.1.3:9094 :
LISTEN 0 128 127.0.0.1:1234 :
LISTEN 0 40000 [::]:12020 [::]:

LISTEN 0 128 [::]:22 [::]:

LISTEN 0 40000 [::]:3000 [::]:*
LISTEN 0 100 [::1]:25 [::]:*
LISTEN 0 40000 [::]:9090 [::]:*
LISTEN 0 128 [::]:40324 [::]:*
LISTEN 0 128 [::]:6123 [::]:*
LISTEN 0 128 [::]:36429 [::]:*
LISTEN 0 128 [::]:8081 [::]:*
", “stderr”: “”, “hash”: “2de5b500c9fae6d418fa200ca150b8d5264d6b19”, “func”: “github.com/pingcap/tiup/pkg/cluster/executor.(*CheckPointExecutor).Execute”, “hit”: false}
2022-08-08T21:28:49.952+0800 DEBUG retry error {“error”: “operation timed out after 2m0s”}
2022-08-08T21:28:49.952+0800 DEBUG setting replication config: leader-schedule-limit=4
2022-08-08T21:28:49.958+0800 DEBUG setting replication config: region-schedule-limit=2048
2022-08-08T21:28:49.963+0800 DEBUG TaskFinish {“task”: “UpgradeCluster”, “error”: “failed to start: 192.168.1.3 node_exporter-9100.service, please check the instance’s log() for more detail.: timed out waiting for port 9100 to be started after 2m0s”, “errorVerbose”: “timed out waiting for port 9100 to be started after 2m0s
github.com/pingcap/tiup/pkg/cluster/module.(*WaitFor).Execute
\tgithub.com/pingcap/tiup/pkg/cluster/module/wait_for.go:91
github.com/pingcap/tiup/pkg/cluster/spec.PortStarted
\tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:116
github.com/pingcap/tiup/pkg/cluster/operation.systemctlMonitor.func1
\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:335
The Go Programming Language
\tgolang.org/x/sync@v0.0.0-20220513210516-0976fa681c29/errgroup/errgroup.go:74
runtime.goexit
\truntime/asm_amd64.s:1571
failed to start: 192.168.1.3 node_exporter-9100.service, please check the instance’s log() for more detail.”}
2022-08-08T21:28:49.963+0800 INFO Execute command finished {“code”: 1, “error”: “failed to start: 192.168.1.3 node_exporter-9100.service, please check the instance’s log() for more detail.: timed out waiting for port 9100 to be started after 2m0s”, “errorVerbose”: “timed out waiting for port 9100 to be started after 2m0s
github.com/pingcap/tiup/pkg/cluster/module.(*WaitFor).Execute
\tgithub.com/pingcap/tiup/pkg/cluster/module/wait_for.go:91
github.com/pingcap/tiup/pkg/cluster/spec.PortStarted
\tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:116
github.com/pingcap/tiup/pkg/cluster/operation.systemctlMonitor.func1
\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:335
The Go Programming Language
\tgolang.org/x/sync@v0.0.0-20220513210516-0976fa681c29/errgroup/errgroup.go:74
runtime.goexit
\truntime/asm_amd64.s:1571
failed to start: 192.168.1.3 node_exporter-9100.service, please check the instance’s log() for more detail.”}

| username: banana_jian | Original post link

Check if the port is occupied, firewall policies, and the logs of the 192.168.1.3 node.

| username: TiDB_New_People | Original post link

The firewall has been turned off.

| username: TiDB_New_People | Original post link

Copy the missing 9100-related files from other normal nodes to the problematic node. This should be a bug related to the tiup upgrade.

| username: TiDB_New_People | Original post link

  1. By checking the system logs with cd /var/log && tail -f -n 1000 messages, we found that the issue was with the node_exporter-9100.service. The problem was not due to port occupation but because the service did not start.
  2. Copy the monitor-9100 file from the deployment directory of a normal node to the problematic node.
  3. Re-upgrade, and it was successful.
| username: jansu-dev | Original post link

“timed out waiting for port 9100 to be” indicates that the command to start systemd.service was likely issued, but it timed out during the waiting process. The specific reason for the timeout is unknown. However, it seems the issue has been resolved, and the workaround is correct.

Is the root cause of the problem that the binary and related directories were not generated, or that the process did not start even though they were generated?

For now, let’s mark this as resolved. If there are any other issues, please leave a message.

| username: system | Original post link

This topic will be automatically closed 60 days after the last reply. No new replies are allowed.