Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: tikv状态接口输出metric过多,请问如何优化呢?
There are too many metrics output by the TiKV status interface. How can I optimize it?
Problems caused:
- Large data storage in Prometheus
- Prometheus times out when fetching TiKV metrics. Out of 8 TiKVs, only 3 have data. For detailed information, you can refer to this post: Grafana展示tikv数量不对,如何解决? - TiDB 的问答社区
I found that one TiKV output to a text file took 22 seconds, the text file was 186M, and had a total of 2,534,332 lines.
Among them:
- tikv_thread_nonvoluntary_context_switches has 503,907 lines
- tikv_thread_voluntary_context_switches has 503,907 lines
- tikv_threads_io_bytes_total has 1,007,806 lines
- tikv_thread_cpu_seconds_total has 503,911 lines
Theoretically, it shouldn’t be this much. It feels like there’s a problem.
Yes, other TiDB clusters don’t have this many.
These metrics do indeed have leakage issues.
workground:
metric_relabel_configs:
- source_labels: [name]
separator: ;
regex: tikv_thread_nonvoluntary_context_switches|tikv_thread_voluntary_context_switches|tikv_threads_io_bytes_total
action: drop
- source_labels: [name,name]
separator: ;
regex: tikv_thread_cpu_seconds_total;(tokio|rocksdb).+
action: drop
Modify the Prometheus configuration. The action
will only affect new data, and the deletion of old data can be completely cleaned up with the Prometheus parameter --storage.tsdb.retention
.
We have already added this, but it only solves the storage issue and does not address the problem of only collecting data from 3 out of 8 TiKV nodes.
Have your three TiKV instances been restarted?
I haven’t restarted it, it’s been running for a long time.
Could someone please take a look? Thanks 
Do the nodes that haven’t been collected have status returns?
Are all the node exporter processes normal?
The large output from curl -s "http://${tikv_ip}:${tikv_status_port}/metrics"
is likely the reason for the long Prometheus job time. But why is the output large? The difference between this cluster and other clusters is the presence of partitioned tables.
After restarting TiKV, the metrics were retrieved normally. As expected, restarting works wonders.
I encountered this issue in version 7.3 as well.
If you only find the TiKV metrics slow, it means your Prometheus machine is quite good.
When I had this issue, Prometheus had already started repeatedly restarting.
Later, I scanned the Prometheus targets interface and found that one of the PD’s metrics interfaces had a Scrape Duration of several minutes, and a direct call could return 1GB of data.
In the end, restarting the problematic PD resolved the issue.
It has nothing to do with the Prometheus machine. Our Prometheus is just an 8-core, 16GB virtual machine. The issue is due to the high volume of TiKV metrics being read, causing Prometheus to time out while reading the metrics. As a result, no data is stored, and naturally, Prometheus does not experience storage or performance pressure. The problem this causes is that Grafana cannot display the monitoring data for the corresponding TiKV nodes.
Restarting cures all problems.
This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.