How to Optimize the Excessive Metrics Output from the TiKV Status Interface?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv状态接口输出metric过多,请问如何优化呢?

| username: DBRE

There are too many metrics output by the TiKV status interface. How can I optimize it?

Problems caused:

  1. Large data storage in Prometheus
  2. Prometheus times out when fetching TiKV metrics. Out of 8 TiKVs, only 3 have data. For detailed information, you can refer to this post: Grafana展示tikv数量不对,如何解决? - TiDB 的问答社区

I found that one TiKV output to a text file took 22 seconds, the text file was 186M, and had a total of 2,534,332 lines.
Among them:

  • tikv_thread_nonvoluntary_context_switches has 503,907 lines
  • tikv_thread_voluntary_context_switches has 503,907 lines
  • tikv_threads_io_bytes_total has 1,007,806 lines
  • tikv_thread_cpu_seconds_total has 503,911 lines
| username: WalterWj | Original post link

Theoretically, it shouldn’t be this much. It feels like there’s a problem.

| username: DBRE | Original post link

Yes, other TiDB clusters don’t have this many.

| username: buddyyuan | Original post link

These metrics do indeed have leakage issues.

workground:

metric_relabel_configs:

  • source_labels: [name]
    separator: ;
    regex: tikv_thread_nonvoluntary_context_switches|tikv_thread_voluntary_context_switches|tikv_threads_io_bytes_total
    action: drop
  • source_labels: [name,name]
    separator: ;
    regex: tikv_thread_cpu_seconds_total;(tokio|rocksdb).+
    action: drop
| username: buddyyuan | Original post link

Modify the Prometheus configuration. The action will only affect new data, and the deletion of old data can be completely cleaned up with the Prometheus parameter --storage.tsdb.retention.

| username: DBRE | Original post link

We have already added this, but it only solves the storage issue and does not address the problem of only collecting data from 3 out of 8 TiKV nodes.

| username: buddyyuan | Original post link

Have your three TiKV instances been restarted?

| username: DBRE | Original post link

I haven’t restarted it, it’s been running for a long time.

| username: DBRE | Original post link

Could someone please take a look? Thanks :rose:

| username: zzzzzz | Original post link

Do the nodes that haven’t been collected have status returns?
Are all the node exporter processes normal?

| username: DBRE | Original post link

  1. The command curl -s "http://${tikv_ip}:${tikv_status_port}/metrics" does produce output, but the response time ranges from several tens of seconds to minutes. The description of the output content can be found in the post above.
  2. The node exporter is functioning normally, but the metric collection for tikv-server is not related to the node exporter. It is collected by the job configured in Prometheus.
| username: DBRE | Original post link

The large output from curl -s "http://${tikv_ip}:${tikv_status_port}/metrics" is likely the reason for the long Prometheus job time. But why is the output large? The difference between this cluster and other clusters is the presence of partitioned tables.

| username: DBRE | Original post link

After restarting TiKV, the metrics were retrieved normally. As expected, restarting works wonders.

| username: swino | Original post link

Divine power protection

| username: 有猫万事足 | Original post link

I encountered this issue in version 7.3 as well.

If you only find the TiKV metrics slow, it means your Prometheus machine is quite good.
When I had this issue, Prometheus had already started repeatedly restarting.

Later, I scanned the Prometheus targets interface and found that one of the PD’s metrics interfaces had a Scrape Duration of several minutes, and a direct call could return 1GB of data.
In the end, restarting the problematic PD resolved the issue.

| username: DBRE | Original post link

It has nothing to do with the Prometheus machine. Our Prometheus is just an 8-core, 16GB virtual machine. The issue is due to the high volume of TiKV metrics being read, causing Prometheus to time out while reading the metrics. As a result, no data is stored, and naturally, Prometheus does not experience storage or performance pressure. The problem this causes is that Grafana cannot display the monitoring data for the corresponding TiKV nodes.

| username: andone | Original post link

Restarting cures all problems.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.