How to Handle TiDB tikvclient_backoff_count Error Alert, It Hasn't Recovered for Half a Day

translator_bot · June 21, 2024, 1:39am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB告警如何处理TiDB tikvclient_backoff_count error,半天了没有恢复

| username: TiDBer_oqrCNpbV

[TiDB Usage Environment] Production Environment
[TiDB Version] v6.1.0
[Reproduction Path] None
[Encountered Problem: Phenomenon and Impact]
Constant alert for tidb_tikvclient_backoff_seconds_count, no recovery, how to handle it, the cluster status appears normal
Alert Content:
[Metric]:

[TiDB tikvclient_backoff_count error

[Description]: cluster: tidb-iap, instance: , values:404.1025641025641
[Start Time]:

[Details]:
alertname: tidb_tikvclient_backoff_seconds_count
cluster: tidb-iap
env: tidb-iap
expr: increase(tidb_tikvclient_backoff_seconds_count[10m]) > 10

translator_bot · June 21, 2024, 1:39am

| username: Jasper | Original post link

Under normal circumstances, region scheduling will cause backoff. If the amount is not particularly large, there is no need for special handling. You can check the Grafana monitoring TiDB - KV ERRORS to observe the specific situation of backoff.

translator_bot · June 21, 2024, 1:39am

| username: tidb菜鸟一只 | Original post link

Backoff is mainly based on your cluster load. If the cluster is very busy, indeed, the backoff will be more frequent. You can check the historical records and set an appropriate value.

translator_bot · June 21, 2024, 1:39am

| username: 哈喽沃德 | Original post link

Check TiKV Status: First, ensure that the TiKV nodes are running normally and there are no obvious signs of high load or other abnormal conditions. You can check the status information of the nodes through the TiKV monitoring interface or logs.
Check Network Connection: Verify that the network connection between TiDB and TiKV nodes is normal and stable, ruling out the possibility of network fluctuations or failures.
Adjust TiKV Configuration: Adjust TiKV configuration parameters based on the actual situation, such as modifying Region-related parameters or Raft parameters, to reduce the frequency of RPC calls.
Upgrade TiDB Version: Consider upgrading TiDB to the latest version, as it may have fixed related bugs or optimized performance.
Increase TiKV Nodes: If high load is causing tikvclient_backoff_count errors, consider increasing the number of TiKV nodes to distribute the load.

translator_bot · June 21, 2024, 1:39am

| username: dba远航 | Original post link

It should be caused by the metadata in PD not being updated in time.

translator_bot · June 21, 2024, 1:39am

| username: redgame | Original post link

Please provide the complete log.

translator_bot · June 21, 2024, 1:39am

| username: 像风一样的男子 | Original post link

The number of retries initiated when TiDB encounters an error accessing TiKV. If the number of retries exceeds 10 within 10 minutes, an alert is triggered.
I think the threshold of 10 is too low and can be appropriately increased.

translator_bot · June 21, 2024, 1:39am

| username: TiDBer_aaO4sU46 | Original post link

Is there any information about backoff?

translator_bot · June 21, 2024, 1:39am

| username: 小于同学 | Original post link

You can consider adding resources and give it a try.