Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: 多集群promethus的部署方案
[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version]
[Reproduction Path] What operations were performed that led to the issue
[Encountered Issue: Problem Phenomenon and Impact] Deployment plan for multiple clusters with Prometheus. How to achieve automatic scaling of nodes
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]
Automatic scaling? Isn’t it all manual scaling?
What I mean is: for example, I have 5 clusters that all need to be monitored. I want to use a single Prometheus setup. How can I automatically ensure that when nodes are scaled up or down, Prometheus can detect this and automatically add or remove the monitored nodes?
Each cluster has its own Prometheus, which is easier to manage. Using a single Prometheus for the entire system means that every time you scale up or down, the configuration file is regenerated, and information from other clusters is lost.
In other words, it is best for each cluster to have its own Prometheus. With 5 clusters, there will be 5 Prometheus instances to meet the ability to automatically monitor newly added nodes and automatically decommission scaled-down nodes.
It’s best not to, otherwise you’ll have to handle it manually. Having one Prometheus for each cluster should be better.
Yes, I have discovered this issue. So, are there also 5 alerts in the alert system? And are there 5 in Grafana as well?
Got it, please provide the Chinese text you need translated.
Wouldn’t it be clearer to look at them separately? Integrating several sets into one Grafana makes it difficult to distinguish issues with individual nodes.
These monitoring nodes don’t consume much resources. At worst, just change the ports and put them all on one machine.
Otherwise, if a large amount of monitoring information from one cluster is reported, causing the monitoring of other clusters to fail, wouldn’t that be even more troublesome?
I will still deploy separately. I encountered the above problem and tested it. It is the same situation as you mentioned.
Why gather together to watch? Wouldn’t that be more inconvenient?
You can use confd to dynamically generate Prometheus configuration files.
Having one TiDB and one Prometheus is indeed quite a waste of resources.
That’s right, it is quite resource-intensive, but it makes maintenance easier.
Learned a lot, thanks for sharing.
Resources and convenience are directly proportional.
Take a look at Prometheus federation clusters, which might meet your needs. 联邦集群 | prometheus-book
It is more appropriate to monitor each set separately.