Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: 扩容tikv,node_exporter未成功安装
[TiDB Usage Environment] Production Environment
4013
[Reproduction Path] What operations were performed when the issue occurred
Scaling out TiKV
[Encountered Issue: Issue Phenomenon and Impact]
TiKV scaling out was successful, but the component node_exporter failed to install, and an error occurred on startup.
Error: failed to start: tikv 10.106.17.71:20160, please check the instance’s log(/data1/data/tidb/tidb-deploy/tikv-20160/log) for more detail.: timed out waiting for port 9115 to be started after 2m0s
Check if the port is occupied. Then check if there are any anomalies in the system logs of 10.106.17.71.
Please provide the detailed logs.
The port is probably occupied.
You can try using netstat or ss to check if port 9115 is listening.
Check if port 9115 is in use, and then review the detailed logs.
Go check the log information in there.
There’s nothing in this log.
There is such a process, and it can’t be killed. Is it a leftover from before?
Delete his command directly.
Still encountering an error during scaling,
2024-03-01T02:28:32.944Z DEBUG TaskFinish {“task”: “StartCluster”, “error”: “failed to start: tikv 10.106.17.71:20166, please check the instance’s log(/data1/tidb/tidb-deploy/tikv-20166/log) for more detail.: timed out waiting for port 9100 to be started after 2m0s”, “errorVerbose”: “timed out waiting for port 9100 to be started after 2m0s\ngithub.com/pingcap/tiup/pkg/cluster/module.(*WaitFor).Execute\n\tgithub.com/pingcap/tiup@/pkg/cluster/module/wait_for.go:90\ngithub.com/pingcap/tiup/pkg/cluster/spec.PortStarted\n\tgithub.com/pingcap/tiup@/pkg/cluster/spec/instance.go:103\ngithub.com/pingcap/tiup/pkg/cluster/operation.StartMonitored\n\tgithub.com/pingcap/tiup@/pkg/cluster/operation/action.go:218\ngithub.com/pingcap/tiup/pkg/cluster/operation.Start\n\tgithub.com/pingcap/tiup@/pkg/cluster/operation/action.go:93\ngithub.com/pingcap/tiup/pkg/cluster/manager.buildScaleOutTask.func6\n\tgithub.com/pingcap/tiup@/pkg/cluster/manager/builder.go:305\ngithub.com/pingcap/tiup/pkg/cluster/task.(*Func).Execute\n\tgithub.com/pingcap/tiup@/pkg/cluster/task/func.go:32\ngithub.com/pingcap/tiup/pkg/cluster/task.(*Serial).Execute\n\tgithub.com/pingcap/tiup@/pkg/cluster/task/task.go:196\ngithub.com/pingcap/tiup/pkg/cluster/manager.(*Manager).ScaleOut\n\tgithub.com/pingcap/tiup@/pkg/cluster/manager/scale_out.go:143\ngithub.com/pingcap/tiup/components/cluster/command.newScaleOutCmd.func1\n\tgithub.com/pingcap/tiup@/components/cluster/command/scale_out.go:54\ngithub.com/spf13/cobra.(*Command).execute\n\tgithub.com/spf13/cobra@v1.0.0/command.go:842\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tgithub.com/spf13/cobra@v1.0.0/command.go:950\ngithub.com/spf13/cobra.(*Command).Execute\n\tgithub.com/spf13/cobra@v1.0.0/command.go:887\ngithub.com/pingcap/tiup/components/cluster/command.Execute\n\tgithub.com/pingcap/tiup@/components/cluster/command/root.go:247\nmain.main\n\tgithub.com/pingcap/tiup@/components/cluster/main.go:23\nruntime.main\n\truntime/proc.go:203\nruntime.goexit\n\truntime/asm_amd64.s:1357\nfailed to start: tikv 10.106.17.71:20166, please check the instance’s log(/data1/tidb/tidb-deploy/tikv-20166/log) for more detail.”}
I checked that the port is not occupied~ # netstat -antpl|grep 9115
netstat -antpl|grep 9100
The system log is still showing the previous directory command
Stopped Nightingale collector.
: Started Nightingale collector.
641]: n9e-collector.service: Failed at step CHDIR spawning /data/services/n9e/n9e-collector: No such file or directory
: n9e-collector.service: Main process exited, code=exited, status=200/CHDIR
: n9e-collector.service: Unit entered failed state.
: n9e-collector.service: Failed with result ‘exit-code’.
: n9e-collector.service: Service hold-off time over, scheduling restart.
: Stopped Nightingale collector.
: Started Nightingale collector.
643]: n9e-collector.service: Failed at step CHDIR spawning /data/services/n9e/n9e-collector: No such file or directory
: n9e-collector.service: Main process exited, code=exited, status=200/CHDIR
: n9e-collector.service: Unit entered failed state.
: n9e-collector.service: Failed with result ‘exit-code’.
: blackbox_exporter-9115.service: Service hold-off time over, scheduling restart.
: node_exporter-9100.service: Service hold-off time over, scheduling restart.
: Stopped node_exporter service.
If the expanded TiKV can be seen using tiup cluster display
, then use the shrink command with --force
to shrink it. Then expand it again.
Directly deleting the node_exporter
script command is incorrect. node_exporter
is made into a system service, so if you want to delete it, you need to delete it completely. Therefore, it is recommended to use the shrink method.
If there are components of other clusters on the expanded machine, there will be port conflicts with node_exporter
. So, if two clusters are mixed on one machine, you need to change the default port of one of the node_exporter
s.
I plan to scale down the problematic TiKV node completely and then redeploy it to see if it works.
Didn’t clean up the environment properly? Use the pure
command to clean up the invalid nodes.
Scaled down and restarted
I have executed it. It seems that the original disk was directly uninstalled before, and the script was on that disk. During the expansion, it assumed that the script had already been installed, so it didn’t install it again and kept reporting errors.
I directly scaled out the TiKV on other machines, scaled in all the TiKV instances on this machine, and then scaled out again.
If the issue has been resolved, please select an answer and mark it as the best answer to close the topic.