Abnormal Prometheus Data Size

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: promethues数据大小异常

| username: 像风一样的男子

[TiDB Usage Environment] Production Environment / Testing / PoC
I have a cluster with only 3 KV, 3 PD, and 3 TiDB nodes. The average QPS is not high, around 1000. Prometheus generates more than 3GB of data daily, and retaining half a year’s data requires close to 600GB of disk space. Is this normal?

| username: wangccsy | Original post link

Have you started using it in the production environment? I’m still in the learning phase. I just got introduced to it not long ago. However, the possibility of us using it is quite low.

| username: 像风一样的男子 | Original post link

There are quite a lot of people using it.

| username: caiyfc | Original post link

Following. We have a production cluster with 30GB per day, not sure what’s going on :joy:

| username: 像风一样的男子 | Original post link

This is a replica set with only 6 machines and very little traffic, but the data volume seems abnormal.

| username: DBRE | Original post link

Is the content in the screenshot from the wal directory?

| username: DBRE | Original post link

What is the usage of each subdirectory in Prometheus? Is there a lot of data in the wal directory?

| username: caiyfc | Original post link

I underestimated; it’s around 70GB. It seems that the WAL is also quite significant.

| username: 普罗米修斯 | Original post link

I see that we use at most 3GB in a day. Your disk is quite large. :joy:

| username: DBRE | Original post link

  1. If the directory starting with prometheus data 01 occupies a lot of space, it indicates that a large amount of data has been collected. You can check the number of returned lines by curling the metrics interfaces of each component of tidb/tikv/pd. If the number of returned lines is in the hundreds of thousands or millions, it is indeed abnormal, and some metrics need to be discarded. Generally speaking, tikv’s metrics might be abnormal. You can add the following configuration at the bottom of the prometheus.yml file under - job_name: “tikv” to reduce data collection, and then restart prometheus through tiup. However, this configuration will be rolled back if there is a topology change.
    metric_relabel_configs:

    • source_labels: [name]
      separator: ;
      regex: tikv_thread_nonvoluntary_context_switches|tikv_thread_voluntary_context_switches|tikv_threads_io_bytes_total
      action: drop
    • source_labels: [name,name]
      separator: ;
      regex: tikv_thread_cpu_seconds_total;(tokio|rocksdb).+
      action: drop
  2. If there are many wal files, it may be because prometheus did not checkpoint in time, usually due to the large amount of collected data. You can filter the log/promethues.log log file for the keywords Starting TSDB … and TSDB started to see if prometheus restarts frequently. However, following the steps in point 1 will reduce this situation.

| username: caiyfc | Original post link

Thank you for the explanation :heart:

| username: 有猫万事足 | Original post link

Got it. Please provide the Chinese text you need translated.

| username: 像风一样的男子 | Original post link

There are too many metrics collected by TiDB. Wouldn’t it be too troublesome if they are all configured this way?

| username: DBRE | Original post link

It is quite troublesome; operations like reload will overwrite the configurations. I couldn’t find a place to configure this in tiup cluster edit-config. Either that, or we need to encapsulate an additional layer of operations on prometheus.yml outside of tiup operations.

| username: 小龙虾爱大龙虾 | Original post link

Check the metric statistics information in the Prometheus UI to see if there is any metric leakage. If a certain metric is occupying too much space, please refer to this configuration to stop collecting that metric:

| username: DBRE | Original post link

This is possible, but we can only use up to version 5.2 :joy:

| username: 像风一样的男子 | Original post link

Is it to use tiup cluster edit-config to edit the cluster configuration and then add metric_relabel_configs rules under monitoring_servers? Will this take effect on every node?
I find the documentation not detailed enough.

| username: 小龙虾爱大龙虾 | Original post link

Just check the final generated configuration file in the Prometheus configuration file.

| username: 小龙虾爱大龙虾 | Original post link

It should not be affected. This feature is implemented by tiup, just upgrade tiup and it will be fine.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.