Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: 手动kill掉prometheus和ng后,ng重启失败
[TiDB Usage Environment] Production Environment
[TiDB Version] 6.5.0
[Reproduction Path] Operations performed that led to the issue
Cause: I needed to modify the Prometheus startup script, so I had to restart.
After manually killing Prometheus and ng, ng failed to restart.
Then I used tiup to restart, but it still failed.
tiup cluster reload ${cluster-name} --role prometheus
[Encountered Issue: Symptoms and Impact]
Permission issue reported
`
[FATAL] [document.go:36]
[“failed to open a badger storage”]
[path=/tidb-data/prometheus-9090/docdb]
[error=“Cannot write pid file "/tidb-data/prometheus-9090/docdb/LOCK"
error: open /tidb-data/prometheus-9090/docdb/LOCK: permission denied”]
[stack=“github.com/pingcap/ng-monitoring/database/document.Init \n\t/home/jenkins/agent/workspace/build-
common/go/src/github.com/pingcap/ng-
monitoring/database/document/document.go:36\ngithub.com/pingcap/ng-
monitoring/database.Init\n\t/home/jenkins/agent/workspace/build-
common/go/src/github.com/pingcap/ng-
monitoring/database/database.go:14\nmain.main\n\t/home/jenkins/agent/workspace/build-
common/go/src/github.com/pingcap/ng-
monitoring/main.go:68\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250”]
`
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]
You are using the root user to execute the startup script, right? Remove the LOCK file, which should be the lock file where the startup script process ID is stored, and then try restarting.
You are using the root user to execute the startup script, right? Try removing the LOCK file, which should be the lock file where the startup script process ID is stored, and then restart to see if it works.
It’s still not working. I delete it, but it gets created immediately. It seems like there’s a daemon process continuously restarting it.
I will add an error log for tiup restarting prometheus. It seems to be an SSH error.
{"code": 1, "error": "init config failed: 10.0.0.80:9090: failed to scp
/root/.tiup/storage/cluster/clusters/ym-tidb/config-cache/prometheus_10.0.0.80_9090.yml to
tidb@10.0.0.80:/tidb-deploy/prometheus-9090/conf/prometheus.yml: Process exited with status 1",
"errorVerbose": "Process exited with status 1\nfailed to scp /root/.tiup/storage/cluster/clusters/ym-
tidb/config-cache/prometheus_10.0.0.80_9090.yml to tidb@10.0.0.80:/tidb-deploy/prometheus-
9090/conf/prometheus.yml\ngithub.com/pingcap/tiup/pkg/cluster/executor.
(*EasySSHExecutor).Transfer\n\tgithub.com/pingcap/tiup/pkg/cluster/executor/ssh.go:207\ngithub.com
/pingcap/tiup/pkg/cluster/executor.
(*CheckPointExecutor).Transfer\n\tgithub.com/pingcap/tiup/pkg/cluster/executor/checkpoint.go:114\ngit
hub.com/pingcap/tiup/pkg/cluster/spec.
(*MonitorInstance).InitConfig\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/monitoring.go:386\ngithub.co
m/pingcap/tiup/pkg/cluster/task.
(*InitConfig).Execute\n\tgithub.com/pingcap/tiup/pkg/cluster/task/init_config.go:50\ngithub.com/pingcap
/tiup/pkg/cluster/task.
(*Serial).Execute\n\tgithub.com/pingcap/tiup/pkg/cluster/task/task.go:86\ngithub.com/pingcap/tiup/pkg/cluster/task.
(*StepDisplay).Execute\n\tgithub.com/pingcap/tiup/pkg/cluster/task/step.go:111\ngithub.com/pingcap/tiup/pkg/cluster/task.
(*Parallel).Execute.func1\n\tgithub.com/pingcap/tiup/pkg/cluster/task/task.go:144\nruntime.goexit\n\tru
ntime/asm_amd64.s:1594\ninit config failed: 10.0.0.80:9090"}
Are you using the root user? I see that the user for scp is tidb.
Yes, boss. I used the root user to execute ng-wrapper.sh and run_prometheus.sh.
At line 230 of this method, I found a very useful comment.
Now there are no errors reported, but the process hasn’t started.
I’m really going to lose it.
[2023/01/11 19:46:50.256 +08:00] [INFO] [printer.go:25] ["Welcome to ng-monitoring."] ["Git Commit Hash"=f1c05e221155c2c95d391957971defbcbbf56832] ["Git Branch"=heads/refs/tags/v6.5.0] ["UTC Build Time"="2022-12-16 08:18:47"] [GoVersion=go1.19.3]
[2023/01/11 19:46:50.256 +08:00] [INFO] [main.go:64] [config] [config="{\"address\":\"0.0.0.0:12020\",\"advertise_address\":\"10.0.0.80:12020\",\"pd\":{\"endpoints\":[\"10.0.0.80:2379\",\"10.0.0.76:2379\",\"10.0.0.74:2379\"]},\"log\":{\"path\":\"/tidb-deploy/prometheus-9090/log\",\"level\":\"INFO\"},\"storage\":{\"path\":\"/tidb-data/prometheus-9090\"},\"continuous_profiling\":{\"enable\":false,\"profile_seconds\":10,\"interval_seconds\":60,\"timeout_seconds\":120,\"data_retention_seconds\":259200},\"security\":{\"ca_path\":\"\",\"cert_path\":\"\",\"key_path\":\"\"}}"]
The experts directly locate the issue at the source code level, that’s impressive.
First, when you execute with the root user, some files in the directory need to be changed when the service starts, but you don’t have the necessary permissions. Fix the permission issue first by running chown tidb.tidb -R /xxxx
. As for LOCK, it’s a process lock that stores the process ID, you don’t need to worry about it as it’s not the problem. When starting the server separately, try to use systemctl XXX prometheus-9090
method.
Yes, that’s right. Yesterday, I tried to resolve it by considering the file being occupied.
Later, the error logs stopped appearing. The current situation is that it still hasn’t started up.
It seems that the system startup command needs to be executed manually, but even using tiup, it doesn’t start up.
What is the error message when starting with tiup? Have you tried using systemctl to start it? Please package and upload the logs from the log directory.