[SOP Series 33] Various TiDB Cluster Monitoring Implementation Solutions

translator_bot · June 23, 2024, 5:46am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 【SOP 系列 33】多种 TiDB 集群监控实现方案

| username: dba小菜鸡-david

Background

The TiDB cluster deployment comes with a complete monitoring system, providing convenience to many TiDB fans. However, there are some drawbacks, such as too many monitoring components and too broad a range of collected metrics. When performance issues arise, it can be difficult to pinpoint the problem while looking at the monitoring data. To address this, TiDB introduced the Performance Overview panel in V6.1.0, which extracts key metrics for quick querying. This is a good improvement, but there is still the issue of how to monitor multiple clusters together. How can hundreds or thousands of instances be displayed on a big screen? What if the leadership needs statistical data urgently? How can regular inspections quickly reveal issues? The following mainly addresses these problems.

1

Solution 1: Prometheus + Grafana + Consul

1.1 Architecture Diagram

By default, Prometheus and Grafana are already deployed when TiDB is installed, so we won’t deploy them again here.

1.2 Deploy Consul

Wget https://releases.hashicorp.com/consul/1.6.1/consul_1.6.1_linux_amd64.zip

unzip consul_1.6.1_linux_amd64.zip && cp consul /sbin/ && mkdir -p /etc/consul.d/ && mkdir -p /data1/consul/

Create configuration file

cat /etc/consul.d/server.json

{
  "datacenter": "bjyt",
  "data_dir": "/data1/consul",
  "log_level": "INFO",
  "node_name": "consul-server",
  "server": true,
  "bootstrap_expect": 1,
  "bind_addr": "xx.xx.xx.xx",
  "client_addr": "xx.xx.xx.xx",
  "ui": true,
  "retry_join": ["xx.xx.xx.xx"],
  "retry_interval": "10s",
  "enable_debug": false,
  "rejoin_after_leave": true,
  "start_join": ["xx.xx.xx.xx"],
  "enable_syslog": true,
  "syslog_facility": "local0"
}

Start Consul:

nohup consul agent -config-dir=/etc/consul.d > /data/consul/consul.log &

Access Consul web management interface

http://ip:8500/

To enhance security, you can also add ACL settings and manage them using the Consul UI token. You can follow the official documentation for setup, which is not covered here.

1.3 Service Registration

Next, we will register the TiDB exporter information to Consul:

First, we need to obtain the TiDB exporter information. There are several ways to do this, such as:

First method:

curl http://ip:9090/api/v1/targets to get all the exporter host/port/targets/labels information of TiDB and then register them.

Second method:

Generally, the metadata information related to TiDB deployment is stored in database tables. In my case, it is stored in a corresponding MySQL table, which makes it easy to query the required metrics for registration.

Service Registration:

The following mainly registers the TiDB/TiKV/PD roles. You can customize which roles to register and add tags.

TiDB:
curl -X PUT -d '{"id": "tidb-exporter","name": "tidb","address": "xx.xx.xx.xx","port": 10080,"tags": ["tidb","shyc2","product","xx.xx.xx.xx","10080"],"checks": [{"http": "http://xx.xx.xx.xx:10080/metrics", "interval": "5s"}]}'  http://xx.xx.xx.xx:8500/v1/agent/service/register

TiKV:
curl -X PUT -d '{"id": "tikv-exporter","name": "tidb","address": "xx.xx.xx.xx","port": 20180,"tags": ["tidb","shyc2","product","xx.xx.xx.xx","20180"],"checks": [{"http": "http://xx.xx.xx.xx:20180/metrics", "interval": "5s"}]}'  http://xx.xx.xx.xx:8500/v1/agent/service/register

PD:
curl -X PUT -d '{"id": "pd-exporter","name": "tidb","address": "xx.xx.xx.xx","port": 2379,"tags": ["tidb","shyc2","product","xx.xx.xx.xx","2379"],"checks": [{"http": "http://xx.xx.xx.xx:2379/metrics", "interval": "5s"}]}'  http://xx.xx.xx.xx:8500/v1/agent/service/register

You can check the Consul web UI to see if the registration was successful.

1.4 Prometheus Integration with Consul:

cat prometheus.yml

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    static_configs:
      - targets: ["localhost:9090"]
  - job_name: 'tidb'
    consul_sd_configs:
      - server: 'xx.xx.xx.xx:8500'
        services: ['tidb']
    relabel_configs:
    - source_labels:  ['__meta_consul_tags']
      regex: ',(.*),(.*),(.*),(.*),(.*),'
      action: replace
      target_label: 'instance'
      replacement: '${1}_${4}_${5}'
    - source_labels:  ['__meta_consul_tags']
      regex: ',(.*),(.*),(.*),(.*),(.*),'
      action: replace
      target_label: 'dc'
      replacement: '${2}'
    - source_labels:  ['__meta_consul_tags']
      regex: ',(.*),(.*),(.*),(.*),(.*),'
      action: replace
      target_label: 'env'
      replacement: '${3}'
    - source_labels:  ['__meta_consul_tags']
      regex: ',(.*),(.*),(.*),(.*),(.*),'
      action: replace
      target_label: 'service'
      replacement: '${1}'
    - source_labels:  ['__meta_consul_service_address']
      regex: "(.*)"
      action: replace
      target_label: 'ip'
      replacement: '${1}'
    - source_labels:  ['__meta_consul_tags']
      regex: ',(.*),(.*),(.*),(.*),(.*),'
      action: replace
      target_label: 'port'
      replacement: '${5}'

You can flexibly use regex to match and replace tags to get the labels you want to display in Grafana.

After a simple Grafana configuration, you can get the customized tag metrics in Consul by executing a TiDB monitoring item like tidb_server_connections.

Solution 1 is suitable when there are not many database instances. The above solution can fully support it. However, in my company, there are tens of thousands of instances of various databases, and Consul registration can time out. Therefore, we adopted the second solution below.

2

Solution 2: VictoriaMetrics + Grafana + API

In short, the idea is to use Tornado technology to write a backend service discovery API, then put the DB exporter metrics into the API. VM then gets the DB exporter information through the API and displays it graphically in Grafana.

2.1 Architecture Diagram:

Why use VM and abandon Prometheus? Mainly because VM has better performance and lower memory usage.

2.2 VM vs Prometheus Performance Test

Each group simultaneously sends 24 requests, each requesting the top 10 disk usage and MySQL OPS within 1 hour.

Each group of requests is made 3 times, and the average value is taken.

Total data volume: 1.3 trillion

Active time series: 11 million

The following screenshot shows the memory usage of VictoriaMetrics and Prometheus processes after executing the above query commands.

You can see that VM’s memory usage is much lower. In a large TiDB cluster with TB storage, Prometheus’s memory usage is close to the system memory. Therefore, replacing Prometheus with VictoriaMetrics for collecting monitoring data has shown significant improvements. Below is a simple introduction to using VM to replace Prometheus. Here we introduce the single-node operation method; the cluster version is relatively more complex and can be deployed according to the official documentation.

2.3 VM Deployment

wget https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/v1.65.0/victoria-metrics-amd64-v1.65.0.tar.gz

mkdir victoria-metrics && tar -xvzf victoria-metrics-amd64-v1.65.0.tar.gz &&
mv victoria-metrics-prod victoria-metrics/victoria-metrics

Edit configuration file

cat /etc/systemd/system/victoria-metrics-prod.service

[Unit]
Description=For Victoria-metrics-prod Service
After=network.target

[Service]
ExecStart=/usr/local/bin/victoria-metrics-prod  -promscrape.config=/data1/tidb/deploy/conf/prometheus.yml -httpListenAddr=0.0.0.0:8428  -promscrape.config.strictParse=false   -storageDataPath=/data1/victoria -retentionPeriod=3

[Install]
WantedBy=multi-user.target

Start VM service:

systemctl restart victoria-metrics-prod.service

2.4 VM Integration with Grafana

Modify Grafana:

API writing is omitted. The service registration method is as follows:

curl -sS --connect-timeout 10 -m 20 --retry 3 --retry-max-time 30 -H 'Content-Type: application/json' -XPUT -d '{"ip": "'${LOCAL_LISTEN_IP}'","instance_port": "'${DB_LISTEN_PORT}'","exporter_port": "'${LOCAL_LISTEN_PORT}'","role":"'${ROLE}'","token":"'${TOKEN}'"}'http://127.0.0.1:8888/tidb/${SERVICE_TYPE}

After executing the above command, the TiDB registration data will be inserted into the MySQL database table:

The specific monitoring items in Grafana can refer to the performance_overview.json introduced in TiDB V6.1.0. Even if the version is lower than 6, you can directly import this JSON and use it, with some monitoring items adjusted as needed.

Finally, the combined monitoring of the TiDB cluster is as follows. The specific monitoring metrics can be customized, mainly referring to the performance_overview.json introduced in V6.1.

Summary:

This article mainly shares two solutions for integrating multi-cluster monitoring of TiDB. There are many other solutions, such as Prometheus federation, etc. Choose one that suits you. The main purpose is to facilitate the inspection and statistics of important monitoring metrics of all online clusters, which can be visualized on a big screen, making it easy to see if there are any performance issues at a glance.