Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: tiup cluster reload卡着不动
[TiDB Usage Environment] Production Environment
[TiDB Version]
[Reproduction Path] What operations were performed when the issue occurred
[Encountered Issue: Issue Phenomenon and Impact]
Modified the configuration using tiup cluster edit-config xxx
, added prepared-plan-cache.enabled: true
, then manually reloaded each node using -N
. The other two nodes executed normally, but the last one got stuck, as shown in the image:
What is the reason for this? Where can I check the logs, and how can I troubleshoot?
tiup cluster audit---Find the command that was just executed tiup cluster audit 4BLhr0---Retry the corresponding task ID It seems that your TiDB configuration has been verified, but it got stuck at Prometheus.
It is recommended to use tiup cluster check ./*.yaml
to check for any failed options before deployment. If there are any, they need to be fixed.
Due to the tiup cluster reload operation regenerating the configuration file, the original configuration of alertmanager was overwritten. After manually correcting it, restarting alertmanager took tens of minutes, so the above method might also work if you don’t interrupt with ctrl+c, but it’s unclear at which step it got stuck.
config_file: This field specifies a local file that will be transferred to the target machine during the cluster configuration initialization phase as the configuration for Alertmanager.
You can configure this parameter under alertmanager_servers and place the modified alertmanager.yml on the control machine. This file will be automatically scp’ed to the alertmanager’s conf directory as alertmanager.yml during each reload. This way, you can avoid modifying the alertmanager configuration every time you reload.
Additionally, if you only need to modify the alertmanager.yml configuration, you can reload Alertmanager separately without restarting. Just use:
curl -X POST 'http://<alertmanager ip>:9093/-/reload'
and it will be done.
Your method is good. In my scenario, it’s not just modifying alertmanager alone, but editing the cluster configuration. When reloading tidb-server, it gets stuck on alertmanager and prometheus instances.
Yes, my reply is unrelated to your main question; it’s just a response to your comment.
If the component is not involved in edit-config
, I think you can even consider
tiup cluster reload <cluster-name> -N <node> -R <role>
For example, the parameters I am modifying now only involve tidb.
tiup cluster reload <cluster-name> -R tidb
After all, the monitoring suite being used is still from around 2020, not the latest version, so encountering some legacy bugs is not surprising.