Expansion of TiKV: node_exporter installation failed

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 扩容tikv,node_exporter未成功安装

| username: 路在何chu

[TiDB Usage Environment] Production Environment
4013
[Reproduction Path] What operations were performed when the issue occurred
Scaling out TiKV
[Encountered Issue: Issue Phenomenon and Impact]
TiKV scaling out was successful, but the component node_exporter failed to install, and an error occurred on startup.
Error: failed to start: tikv 10.106.17.71:20160, please check the instance’s log(/data1/data/tidb/tidb-deploy/tikv-20160/log) for more detail.: timed out waiting for port 9115 to be started after 2m0s

| username: caiyfc | Original post link

Check if the port is occupied. Then check if there are any anomalies in the system logs of 10.106.17.71.

| username: 小于同学 | Original post link

Please provide the detailed logs.

| username: 小于同学 | Original post link

The port is probably occupied.

| username: TiDBer_ivan0927 | Original post link

You can try using netstat or ss to check if port 9115 is listening.

| username: dba远航 | Original post link

Check if port 9115 is in use, and then review the detailed logs.

| username: onlyacat | Original post link

Go check the log information in there.

| username: 路在何chu | Original post link

There’s nothing in this log.

| username: 路在何chu | Original post link

There is such a process, and it can’t be killed. Is it a leftover from before?

| username: 路在何chu | Original post link

Delete his command directly.

| username: 路在何chu | Original post link

Still encountering an error during scaling,

2024-03-01T02:28:32.944Z DEBUG TaskFinish {“task”: “StartCluster”, “error”: “failed to start: tikv 10.106.17.71:20166, please check the instance’s log(/data1/tidb/tidb-deploy/tikv-20166/log) for more detail.: timed out waiting for port 9100 to be started after 2m0s”, “errorVerbose”: “timed out waiting for port 9100 to be started after 2m0s\ngithub.com/pingcap/tiup/pkg/cluster/module.(*WaitFor).Execute\n\tgithub.com/pingcap/tiup@/pkg/cluster/module/wait_for.go:90\ngithub.com/pingcap/tiup/pkg/cluster/spec.PortStarted\n\tgithub.com/pingcap/tiup@/pkg/cluster/spec/instance.go:103\ngithub.com/pingcap/tiup/pkg/cluster/operation.StartMonitored\n\tgithub.com/pingcap/tiup@/pkg/cluster/operation/action.go:218\ngithub.com/pingcap/tiup/pkg/cluster/operation.Start\n\tgithub.com/pingcap/tiup@/pkg/cluster/operation/action.go:93\ngithub.com/pingcap/tiup/pkg/cluster/manager.buildScaleOutTask.func6\n\tgithub.com/pingcap/tiup@/pkg/cluster/manager/builder.go:305\ngithub.com/pingcap/tiup/pkg/cluster/task.(*Func).Execute\n\tgithub.com/pingcap/tiup@/pkg/cluster/task/func.go:32\ngithub.com/pingcap/tiup/pkg/cluster/task.(*Serial).Execute\n\tgithub.com/pingcap/tiup@/pkg/cluster/task/task.go:196\ngithub.com/pingcap/tiup/pkg/cluster/manager.(*Manager).ScaleOut\n\tgithub.com/pingcap/tiup@/pkg/cluster/manager/scale_out.go:143\ngithub.com/pingcap/tiup/components/cluster/command.newScaleOutCmd.func1\n\tgithub.com/pingcap/tiup@/components/cluster/command/scale_out.go:54\ngithub.com/spf13/cobra.(*Command).execute\n\tgithub.com/spf13/cobra@v1.0.0/command.go:842\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tgithub.com/spf13/cobra@v1.0.0/command.go:950\ngithub.com/spf13/cobra.(*Command).Execute\n\tgithub.com/spf13/cobra@v1.0.0/command.go:887\ngithub.com/pingcap/tiup/components/cluster/command.Execute\n\tgithub.com/pingcap/tiup@/components/cluster/command/root.go:247\nmain.main\n\tgithub.com/pingcap/tiup@/components/cluster/main.go:23\nruntime.main\n\truntime/proc.go:203\nruntime.goexit\n\truntime/asm_amd64.s:1357\nfailed to start: tikv 10.106.17.71:20166, please check the instance’s log(/data1/tidb/tidb-deploy/tikv-20166/log) for more detail.”}

| username: 路在何chu | Original post link

I checked that the port is not occupied~ # netstat -antpl|grep 9115

netstat -antpl|grep 9100

| username: 路在何chu | Original post link

The system log is still showing the previous directory command
Stopped Nightingale collector.
: Started Nightingale collector.
641]: n9e-collector.service: Failed at step CHDIR spawning /data/services/n9e/n9e-collector: No such file or directory
: n9e-collector.service: Main process exited, code=exited, status=200/CHDIR
: n9e-collector.service: Unit entered failed state.
: n9e-collector.service: Failed with result ‘exit-code’.
: n9e-collector.service: Service hold-off time over, scheduling restart.
: Stopped Nightingale collector.
: Started Nightingale collector.
643]: n9e-collector.service: Failed at step CHDIR spawning /data/services/n9e/n9e-collector: No such file or directory
: n9e-collector.service: Main process exited, code=exited, status=200/CHDIR
: n9e-collector.service: Unit entered failed state.
: n9e-collector.service: Failed with result ‘exit-code’.
: blackbox_exporter-9115.service: Service hold-off time over, scheduling restart.
: node_exporter-9100.service: Service hold-off time over, scheduling restart.
: Stopped node_exporter service.

| username: caiyfc | Original post link

If the expanded TiKV can be seen using tiup cluster display, then use the shrink command with --force to shrink it. Then expand it again.

Directly deleting the node_exporter script command is incorrect. node_exporter is made into a system service, so if you want to delete it, you need to delete it completely. Therefore, it is recommended to use the shrink method.

If there are components of other clusters on the expanded machine, there will be port conflicts with node_exporter. So, if two clusters are mixed on one machine, you need to change the default port of one of the node_exporters.

| username: 路在何chu | Original post link

I plan to scale down the problematic TiKV node completely and then redeploy it to see if it works.

| username: Jolyne | Original post link

Didn’t clean up the environment properly? Use the pure command to clean up the invalid nodes.

| username: WinterLiu | Original post link

Scaled down and restarted :grin:

| username: 路在何chu | Original post link

I have executed it. It seems that the original disk was directly uninstalled before, and the script was on that disk. During the expansion, it assumed that the script had already been installed, so it didn’t install it again and kept reporting errors.

| username: 路在何chu | Original post link

I directly scaled out the TiKV on other machines, scaled in all the TiKV instances on this machine, and then scaled out again.

| username: IanWong | Original post link

If the issue has been resolved, please select an answer and mark it as the best answer to close the topic.