Eliminate Alarms, Adjust Parameters

translator_bot · June 21, 2024, 5:23pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 消除告警，调整参数

| username: TiDBer_Y2d2kiJh

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] v5.4.0 3tidb 3pd 3tikv
[Reproduction Path] The monitoring indicates TiDB_tikvclient_backoff_seconds_count[10M]>10. This warning does not affect the business but occurs frequently. How can I eliminate this warning, and how can I increase the TiDB_tikvclient_backoff_seconds_count parameter?
[Encountered Problem: Problem Phenomenon and Impact]
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachment: Screenshot/Log/Monitoring]

translator_bot · June 21, 2024, 5:23pm

| username: Fly-bird | Original post link

Modify the parameters in the /tidb-deploy/prometheus-8249/conf/tidb.rules.yml file

labels:
env: tsp-prod-tidb-cluster
level: warning
expr: increase(tidb_tikvclient_backoff_seconds_count[10m]) > 10

Then restart the Prometheus service

`TiDB_tikvclient_backoff_seconds_count`

Alert rule: increase(tidb_tikvclient_backoff_seconds_count[10m]) > 10
Rule description: The number of retries initiated when TiDB encounters an error accessing TiKV. If the number of retries exceeds 10 within 10 minutes, an alert is triggered.