Upgrade from V5 to V7 Interrupted Due to node_export Failed to Start

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: V5版本升级到V7,因为node_export failed to start而中断upgrade流程

| username: dockerfile

【TiDB Usage Environment】Production Environment
【TiDB Version】Upgraded from V5 to V7.1.2
【Reproduction Path】Both tiup upgrade cluster and tiup restart -R prometheus can reproduce the issue
【Encountered Issue: Problem Phenomenon and Impact】

1. During the cluster upgrade, all components were basically upgraded successfully, but the upgrade failed when restarting node_exporter on one of the nodes, causing the upgrade to exit.
2. Checking the dashboard, the cluster version, tidb, pd, and tikv versions are displayed as V7.
3. Using tiup display, the cluster version is displayed as V5.
4. Repeated attempts: tiup restart -R prometheus always results in node_exporter timeout.

Error: failed to start: failed to start: a.b.c.x7 node_exporter-9100.service, please check the instance's log() for more detail.: timed out waiting for port 9100 to be started after 2m0s

5. The final error during the upgrade is as follows:

Upgrading component tidb
        Restarting instance a.b.c.x1:4000
        Restart instance a.b.c.x1:4000 success
        Restarting instance a.b.c.x2:4000
        Restart instance a.b.c.x2:4000 success
        Restarting instance a.b.c.x1:4071
        Restart instance a.b.c.x1:4071 success
Upgrading component prometheus
        Restarting instance a.b.c.x7:9090
        Restart instance a.b.c.x7:9090 success
Upgrading component grafana
        Restarting instance a.b.c.x7:3000
        Restart instance a.b.c.x7:3000 success
Upgrading component alertmanager
        Restarting instance a.b.c.x7:9093
        Restart instance a.b.c.x7:9093 success
Stopping component node_exporter
        Stopping instance a.b.c.x1
        Stopping instance a.b.c.x1
        Stopping instance a.b.c.x2
        Stopping instance a.b.c.x4
        Stopping instance a.b.c.x7
        Stopping instance a.b.c.x0
        Stopping instance a.b.c.x6
        Stopping instance a.b.c.x5
        Stopping instance a.b.c.x8
        Stopping instance a.b.c.x3
        Stop a.b.c.x3 success
        Stop a.b.c.x5 success
        Stop a.b.c.x8 success
        Stop a.b.c.x7 success
        Stop a.b.c.x6 success
        Stop a.b.c.x2 success
        Stop a.b.c.x1 success
        Stop a.b.c.x4 success
        Stop a.b.c.x1 success
        Stop a.b.c.x0 success
Stopping component blackbox_exporter
        Stopping instance a.b.c.x1
        Stopping instance a.b.c.x2
        Stopping instance a.b.c.x5
        Stopping instance a.b.c.x0
        Stopping instance a.b.c.x1
        Stopping instance a.b.c.x7
        Stopping instance a.b.c.x3
        Stopping instance a.b.c.x4
        Stopping instance a.b.c.x8
        Stopping instance a.b.c.x6
        Stop a.b.c.x5 success
        Stop a.b.c.x3 success
        Stop a.b.c.x8 success
        Stop a.b.c.x7 success
        Stop a.b.c.x6 success
        Stop a.b.c.x4 success
        Stop a.b.c.x2 success
        Stop a.b.c.x1 success
        Stop a.b.c.x1 success
        Stop a.b.c.x0 success
Starting component node_exporter
        Starting instance a.b.c.x4
        Starting instance a.b.c.x5
        Starting instance a.b.c.x8
        Starting instance a.b.c.x0
        Starting instance a.b.c.x6
        Starting instance a.b.c.x2
        Starting instance a.b.c.x7
        Starting instance a.b.c.x1
        Starting instance a.b.c.x3
        Starting instance a.b.c.x1
        Start a.b.c.x5 success
        Start a.b.c.x3 success
        Start a.b.c.x8 success
        Start a.b.c.x1 success
        Start a.b.c.x6 success
        Start a.b.c.x4 success
        Start a.b.c.x2 success
        Start a.b.c.x1 success
        Start a.b.c.x0 success

Error: failed to start: a.b.c.x7 node_exporter-9100.service, please check the instance's log() for more detail.: timed out waiting for port 9100 to be started after 2m0s

Attempting to use systemctl restart node_exporter-9100 on the node a.b.c.x7 works without issues.

Suspected Cause: The cluster was historically deployed via ansible (from V2 through V3, tiup V4, V5 to V7 today), and the bin and script directories on the error node (x7) are not standardized.

| username: dba远航 | Original post link

Haven’t encountered it, learning about it.

| username: 江湖故人 | Original post link

If there are no other logs available for analysis, you can compare the file directories with a newly deployed v5. Your upgrade path is somewhat complex, and it’s unclear if there are any bugs. If it were me, I might consider preparing for migration and redeployment.

| username: dockerfile | Original post link

I was wondering if it’s possible to uninstall and redeploy the node_export on this node? How to do it?

Currently, the upgrade cluster ultimately failed. Although the cluster status is normal and the core components are already V7, I still feel that the tiup upgrade didn’t run smoothly to the end. I didn’t see the final success, which feels quite awkward.

| username: zhanggame1 | Original post link

The monitoring can be uninstalled, and it can be added later.

| username: 江湖故人 | Original post link

You can scale down Prometheus and then scale it up again.

| username: 江湖故人 | Original post link

Didn’t the post #3 above mention that tiup display is still V5?

| username: dockerfile | Original post link

This solution is not feasible because I want to retain Prometheus monitoring data.
Currently, the restart of Prometheus should be normal, but the key issue is stuck with the node_exporter on one of the nodes. Can we just reinstall this one?

| username: WalterWj | Original post link

Either manually fix the node export on the corresponding node, or modify the configuration in meta to skip the node export on this node.


I looked at the code, try configuring this setting for all components on this node.

| username: dockerfile | Original post link

Is it possible to reinstall or migrate Prometheus?

| username: dockerfile | Original post link

I spent a whole day and finally solved it.

Phenomenon: tiup restarting Prometheus gets stuck at the step of restarting node_exporter and times out.

Error: No anomalies found in the node_exporter logs, the node process exists, but the 9100 port fails to listen successfully.

Troubleshooting: Checking the system log /var/log/message, I found a lot of prompts:

Jan  8 20:26:57 xxxxx systemd-logind: Failed to start user slice user-0.slice, ignoring: The maximum number of pending replies per connection has been reached (org.freedesktop.DBus.Error.LimitsExceeded)
Jan  8 20:26:57 xxxxx systemd-logind: Failed to start session scope session-c68372803.scope: The maximum number of pending replies per connection has been reached
Jan  8 20:26:58 xxxxx systemd-logind: Failed to start user slice user-0.slice, ignoring: The maximum number of pending replies per connection has been reached (org.freedesktop.DBus.Error.LimitsExceeded)
Jan  8 20:26:58 xxxxx systemd-logind: Failed to start session scope session-c68372804.scope: The maximum number of pending replies per connection has been reached
Jan  8 20:26:59 xxxxx systemd-logind: Failed to start user slice user-0.slice, ignoring: The maximum number of pending replies per connection has been reached (org.freedesktop.DBus.Error.LimitsExceeded)
Jan  8 20:26:59 xxxxx systemd-logind: Failed to start session scope session-c68372805.scope: The maximum number of pending replies per connection has been reached

Searching for the error, I found a Q&A on the Red Hat website:

Finally, I executed the command:

systemctl daemon-reexec

At this point, node_exporter started successfully, the 9100 port was listening normally, and I tried to restart Prometheus with tiup again, which went smoothly.

Finally, I used tiup replay to continue the upgrade from the last upgrade breakpoint.

tiup cluster replay xxxx

During the process, there were constant stutters, and I repeatedly executed reload, stop, and start on all nodes:

systemctl daemon-reload
systemctl stop/start blackbox_exporter
systemctl stop/start node_exporter-9100

Eventually, I completed the entire upgrade process:

Upgraded cluster `xxxxxx` successfully

tiup display:

[root@tidbxxx ~]# tiup cluster display xxxxx
tiup is checking updates for component cluster ...
Starting component `cluster`: /root/.tiup/components/cluster/v1.14.0/tiup-cluster display xxxxx 
Cluster type:       tidb
Cluster name:      xxxxx
Cluster version:    v7.1.2
| username: TIDB-Learner | Original post link

The command systemctl daemon-reexec re-executes the systemd daemon. This is related to issues in the Linux system, such as socket problems. The issue was resolved, but if different people were to solve the same problem, it might not necessarily be resolved in the same way.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.