After manually killing Prometheus and NG, NG fails to restart

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 手动kill掉prometheus和ng后,ng重启失败

| username: yinyuncan

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.5.0
[Reproduction Path] Operations performed that led to the issue

Cause: I needed to modify the Prometheus startup script, so I had to restart.

After manually killing Prometheus and ng, ng failed to restart.

Then I used tiup to restart, but it still failed.

tiup cluster reload ${cluster-name} --role prometheus

[Encountered Issue: Symptoms and Impact]
Permission issue reported

`
[FATAL] [document.go:36]

[“failed to open a badger storage”]

[path=/tidb-data/prometheus-9090/docdb]

[error=“Cannot write pid file "/tidb-data/prometheus-9090/docdb/LOCK"
error: open /tidb-data/prometheus-9090/docdb/LOCK: permission denied”]

[stack=“github.com/pingcap/ng-monitoring/database/document.Init\n\t/home/jenkins/agent/workspace/build-
common/go/src/github.com/pingcap/ng-
monitoring/database/document/document.go:36\ngithub.com/pingcap/ng-
monitoring/database.Init\n\t/home/jenkins/agent/workspace/build-
common/go/src/github.com/pingcap/ng-
monitoring/database/database.go:14\nmain.main\n\t/home/jenkins/agent/workspace/build-
common/go/src/github.com/pingcap/ng-
monitoring/main.go:68\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250”]
`
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

| username: ffeenn | Original post link

You are using the root user to execute the startup script, right? Remove the LOCK file, which should be the lock file where the startup script process ID is stored, and then try restarting.

| username: yinyuncan | Original post link

It’s still not working. I delete it, but it gets created immediately. It seems like there’s a daemon process continuously restarting it.

| username: yinyuncan | Original post link

I will add an error log for tiup restarting prometheus. It seems to be an SSH error.

  {"code": 1, "error": "init config failed: 10.0.0.80:9090: failed to scp 
/root/.tiup/storage/cluster/clusters/ym-tidb/config-cache/prometheus_10.0.0.80_9090.yml to 
tidb@10.0.0.80:/tidb-deploy/prometheus-9090/conf/prometheus.yml: Process exited with status 1", 
"errorVerbose": "Process exited with status 1\nfailed to scp /root/.tiup/storage/cluster/clusters/ym-
tidb/config-cache/prometheus_10.0.0.80_9090.yml to tidb@10.0.0.80:/tidb-deploy/prometheus-
9090/conf/prometheus.yml\ngithub.com/pingcap/tiup/pkg/cluster/executor.

(*EasySSHExecutor).Transfer\n\tgithub.com/pingcap/tiup/pkg/cluster/executor/ssh.go:207\ngithub.com
/pingcap/tiup/pkg/cluster/executor.
(*CheckPointExecutor).Transfer\n\tgithub.com/pingcap/tiup/pkg/cluster/executor/checkpoint.go:114\ngit
hub.com/pingcap/tiup/pkg/cluster/spec.
(*MonitorInstance).InitConfig\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/monitoring.go:386\ngithub.co
m/pingcap/tiup/pkg/cluster/task.
(*InitConfig).Execute\n\tgithub.com/pingcap/tiup/pkg/cluster/task/init_config.go:50\ngithub.com/pingcap
/tiup/pkg/cluster/task.
(*Serial).Execute\n\tgithub.com/pingcap/tiup/pkg/cluster/task/task.go:86\ngithub.com/pingcap/tiup/pkg/cluster/task.
(*StepDisplay).Execute\n\tgithub.com/pingcap/tiup/pkg/cluster/task/step.go:111\ngithub.com/pingcap/tiup/pkg/cluster/task.
(*Parallel).Execute.func1\n\tgithub.com/pingcap/tiup/pkg/cluster/task/task.go:144\nruntime.goexit\n\tru
ntime/asm_amd64.s:1594\ninit config failed: 10.0.0.80:9090"}

| username: 裤衩儿飞上天 | Original post link

Are you using the root user? I see that the user for scp is tidb.

| username: yinyuncan | Original post link

Yes, boss. I used the root user to execute ng-wrapper.sh and run_prometheus.sh.

| username: yinyuncan | Original post link

Error code: ng-monitoring/database/document/document.go at main · pingcap/ng-monitoring · GitHub

| username: yinyuncan | Original post link

Continue tracking the error
Code address: chai/engine/badgerengine/engine.go at edc47458e29a6847b32aed7ccaa4df4454a50e8d · chaisql/chai · GitHub

| username: yinyuncan | Original post link

Then continue searching

| username: yinyuncan | Original post link

Code location: bbolt/db.go at 685b13a4ef0053a4a38623bcebda621db6f7eaf7 · etcd-io/bbolt · GitHub

| username: yinyuncan | Original post link

At line 230 of this method, I found a very useful comment.

| username: yinyuncan | Original post link

Code location: bbolt/bolt_unix.go at 685b13a4ef0053a4a38623bcebda621db6f7eaf7 · etcd-io/bbolt · GitHub

| username: yinyuncan | Original post link

Now there are no errors reported, but the process hasn’t started.
I’m really going to lose it.

[2023/01/11 19:46:50.256 +08:00] [INFO] [printer.go:25] ["Welcome to ng-monitoring."] ["Git Commit Hash"=f1c05e221155c2c95d391957971defbcbbf56832] ["Git Branch"=heads/refs/tags/v6.5.0] ["UTC Build Time"="2022-12-16 08:18:47"] [GoVersion=go1.19.3]
[2023/01/11 19:46:50.256 +08:00] [INFO] [main.go:64] [config] [config="{\"address\":\"0.0.0.0:12020\",\"advertise_address\":\"10.0.0.80:12020\",\"pd\":{\"endpoints\":[\"10.0.0.80:2379\",\"10.0.0.76:2379\",\"10.0.0.74:2379\"]},\"log\":{\"path\":\"/tidb-deploy/prometheus-9090/log\",\"level\":\"INFO\"},\"storage\":{\"path\":\"/tidb-data/prometheus-9090\"},\"continuous_profiling\":{\"enable\":false,\"profile_seconds\":10,\"interval_seconds\":60,\"timeout_seconds\":120,\"data_retention_seconds\":259200},\"security\":{\"ca_path\":\"\",\"cert_path\":\"\",\"key_path\":\"\"}}"]
| username: ohammer | Original post link

The experts directly locate the issue at the source code level, that’s impressive.

| username: ffeenn | Original post link

First, when you execute with the root user, some files in the directory need to be changed when the service starts, but you don’t have the necessary permissions. Fix the permission issue first by running chown tidb.tidb -R /xxxx. As for LOCK, it’s a process lock that stores the process ID, you don’t need to worry about it as it’s not the problem. When starting the server separately, try to use systemctl XXX prometheus-9090 method.

| username: yinyuncan | Original post link

Yes, that’s right. Yesterday, I tried to resolve it by considering the file being occupied.

Later, the error logs stopped appearing. The current situation is that it still hasn’t started up.

It seems that the system startup command needs to be executed manually, but even using tiup, it doesn’t start up.

| username: ffeenn | Original post link

What is the error message when starting with tiup? Have you tried using systemctl to start it? Please package and upload the logs from the log directory.