How to Clean Up TiDB Monitoring Data That Occupies a Large Amount of Disk Space

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb监控数据占用大量磁盘空间,如何清理

| username: TiDBer_7Q5CQdQd

The data in this Prometheus directory is too large. How can I clean it up?

You can sort the files by time and delete the files from the time periods you don’t need.

| username: Kongdom | Original post link

Which folder under the ./prometheus-9092 directory is taking up the most space?

| username: TiDBer_7Q5CQdQd | Original post link

They are all about the same, around 800m. I don’t know the purpose of these folders.

| username: Kongdom | Original post link

https://docs.pingcap.com/zh/tidb/stable/tiup-cluster-topology-reference#monitoring_servers Refer to the configuration here, set storage_retention.

storage_retention: Prometheus monitoring data retention time, default is “30d”

| username: zhanggame1 | Original post link

If you set storage_retention to a shorter period, it will automatically delete. If the data is not needed, you can scale in the monitoring and then scale out.

| username: dba远航 | Original post link

Set the parameter storage_retention. If you don’t need to retain data for too long, you can set the value to be smaller.

| username: 像风一样的男子 | Original post link

  1. If the directory starting with prometheus data 01 occupies a lot of space, it indicates that a large amount of data has been collected. You can check the number of returned lines by curling the metrics interfaces of each component of tidb/tikv/pd. If the number of returned lines is in the hundreds of thousands or millions, it is indeed abnormal, and some metrics need to be discarded. Generally speaking, tikv’s metrics might be abnormal. You can add the following configuration at the bottom of the prometheus.yml file under - job_name: “tikv” to reduce data collection, and then restart prometheus using tiup. However, this configuration will be rolled back if there is a topology change.
    metric_relabel_configs:
  • source_labels: [name]
    separator: ;
    regex: tikv_thread_nonvoluntary_context_switches|tikv_thread_voluntary_context_switches|tikv_threads_io_bytes_total
    action: drop
  • source_labels: [name,name]
    separator: ;
    regex: tikv_thread_cpu_seconds_total;(tokio|rocksdb).+
    action: drop
  1. If there is a lot of wal, it may be because prometheus did not checkpoint in time, usually due to the large amount of collected data. You can filter the Starting TSDB … and TSDB started keywords in the log/promethues.log log file to see if prometheus restarts frequently. However, following step 1 will reduce this situation.
| username: TiDBer_7Q5CQdQd | Original post link

Will the changes take effect immediately?

| username: Kongdom | Original post link

It should take effect immediately.

| username: kkpeter | Original post link

We have set it to 14 days online, retaining two weeks of data.

| username: 啦啦啦啦啦 | Original post link

It’s okay, it’s only 83G. By default, it retains data for 30 days. We keep it for 90 days because sometimes we need to check historical monitoring to troubleshoot issues. If you don’t need that much, you can reduce it a bit.

| username: xingzhenxiang | Original post link

Adjust the storage_retention policy.

| username: porpoiselxj | Original post link

Automatically clean up anomalies, you can consider scaling in first, then scaling out.

| username: 连连看db | Original post link

The monitoring machine’s disk is too small.

| username: 像风一样的男子 | Original post link

I previously posted about a similar issue, but I got busy with other things and didn’t address it.

| username: Soysauce520 | Original post link

Change the parameters to adjust the time and restart Prometheus.

| username: TiDBer_vfJBUcxl | Original post link

Modify the parameter storage_retention

| username: TiDBer_jYQINSnf | Original post link

Modify the scraping interval from the original 15 seconds to 1 minute.
Modify the retention policy.
As for the data in the Prometheus directory, I directly deleted the directory. I deleted from the old to the new, since it’s monitoring data, if there are any issues after deletion, we can just start scraping from today again.

| username: WinterLiu | Original post link

The person above is right, we should consider both frequency and retention time.

| username: yulei7633 | Original post link

How can I check this storage_retention?


I used this query, but I couldn’t find anything.

It can be configured here. I don’t know how to query it?