[TiDB Usage Environment] Testing
[Reproduction Path] None
[Encountered Problem: Phenomenon and Impact]
The node has multiple roles, with Prometheus and PD both installed on it. Previously, the node crashed due to a large amount of cached data in Prometheus. After restarting and using tiup cluster display tidb-lab, the PD node does not include this node. However, when opening the Dashboard, this node is still listed in the PD node list.
I would like to know which status is accurate and how to synchronize the states between the two sides?
[Attachment: Screenshot/Log/Monitoring]
I’ll first look for other solutions because this issue appeared after restarting the node. I’m worried that similar problems might occur online, and we won’t be able to resolve them by restarting the cluster.
Is there a missing PD node in tiup? Have you performed any operations on this PD node using tiup before? The topo information of tiup is stored locally on the tiup machine. If you haven’t used tiup to operate, theoretically, tiup cluster display should not lack information.
Yes, the display of tiup is missing a PD node. We performed an expansion operation on this node a few months ago, but there should not have been any similar operations recently.
Is the PD process still running on that host? Check the online configuration with tiup cluster edit-config tidb-lab to see if the configuration for this PD is still there.
The PD process on the machine is still running. I’ll try to re-specify the expansion. tidb 1430 1.9 2.0 20237948 334792 ? Ssl Jan04 174:45 bin/pd-server --name=pd-10.247.168.75-2378 --client-urls=http://0.0.0.0:2378 --advertise-client-urls=http://10.247.168.75:2378 --peer-urls=http://0.0.0.0:2380 --advertise-peer-urls=http://10.247.168.75:2380 --data-dir=/tidb-data/pd-2378 --join=http://10.247.168.18:2378,http://10.247.168.77:2378 --config=conf/pd.toml --log-file=/tidb-deploy/pd-2378/log/pd.log
An error occurred during scaling, and manually executing systemctl enable node_exporter-9100.service also resulted in the same error.
2023-01-10T07:31:22.234Z ERROR CheckPoint {“host”: “10.247.168.75”, “port”: 22, “user”: “tidb”, “sudo”: true, “cmd”: “systemctl daemon-reload && systemctl enable node_exporter-9100.service”, “stdout”: “”, “stderr”: “Failed to execute operation: No such file or directory\n”, “error”: “executor.ssh.execute_failed: Failed to execute command over SSH for ‘tidb@10.247.168.75:22’ {ssh_stderr: Failed to execute operation: No such file or directory\n, ssh_stdout: , ssh_command: export LANG=C; PATH=$PATH:/bin:/sbin:/usr/bin:/usr/sbin /usr/bin/sudo -H bash -c "systemctl daemon-reload && systemctl enable node_exporter-9100.service"}, cause: Process exited with status 1”, “errorVerbose”: “executor.ssh.execute_failed: Failed to execute command over SSH for ‘tidb@10.247.168.75:22’ {ssh_stderr: Failed to execute operation: No such file or directory\n, ssh_stdout: , ssh_command: export LANG=C; PATH=$PATH:/bin:/sbin:/usr/bin:/usr/sbin /usr/bin/sudo -H bash -c "systemctl daemon-reload && systemctl enable node_exporter-9100.service"}, cause: Process exited with status 1\n at github.com/pingcap/tiup/pkg/cluster/executor.(*EasySSHExecutor).Execute()\n\tgithub.com/pingcap/tiup/pkg/cluster/executor/ssh.go:174\n at github.com/pingcap/tiup/pkg/cluster/executor.(*CheckPointExecutor).Execute()\n\tgithub.com/pingcap/tiup/pkg/cluster/executor/checkpoint.go:85\n at github.com/pingcap/tiup/pkg/cluster/module.(*SystemdModule).Execute()\n\tgithub.com/pingcap/tiup/pkg/cluster/module/systemd.go:98\n at github.com/pingcap/tiup/pkg/cluster/operation.systemctl()\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:376\n at github.com/pingcap/tiup/pkg/cluster/operation.systemctlMonitor.func1()\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:286\n at The Go Programming Language\n\tgolang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57\n at runtime.goexit()\n\truntime/asm_amd64.s:1581”, “hash”: “ce8eb0a645cc3ead96a44d67b1ecd5034d112cf0”, “func”: “github.com/pingcap/tiup/pkg/cluster/executor.(*CheckPointExecutor).Execute”, “hit”: false}
Manually remove the PD on the abnormal node, then use tiup to scale out again. There might have been an issue during the previous tiup scaling operation. You can upload the audit log of the scaling operation using tiup cluster audit for review.
You are having issues starting this PD. Use pdctl to log in and check the health status, then find the corresponding abnormal node ID through member. After deleting the member with ID 1319539429105371180, try to scale out again.
Previously, I found that the expansion failure was due to the node_exporter.service not being registered on the original machine. After creating the relevant link and registering it, the expansion was successful, and now it shows consistency. I suspect that the previous inconsistency might be related to the node_exporter issue.