How to Resolve/Mitigate High Backoff Caused by TiDB RegionMiss

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIDB regionMiss导致 Backoff太高怎么解决/缓解

| username: jaybing926

[TiDB Usage Environment] Production Environment
[TiDB Version] v5.4.3
[Encountered Problem: Phenomenon and Impact]
Recently, we have frequently encountered the tidb_tikvclient_backoff_seconds_count alert. We had previously adjusted the alert threshold to 3000, but today it alerted again, reaching an astonishing 30,000.

Upon checking the monitoring, we found that it was caused by regionMiss.

I would like to ask how to troubleshoot this issue to determine if it is abnormal.
If it is normal region scheduling, how can we mitigate this problem?

| username: 江湖故人 | Original post link

Is the leader-schedule-limit set too high?

| username: yiduoyunQ | Original post link

Is there a TiKV node down?

| username: 连连看db | Original post link

I just received an alert for the same issue, but the TiKV nodes haven’t crashed.

| username: dba远航 | Original post link

It might be caused by outdated metadata in PD.

| username: jaybing926 | Original post link

Are you referring to the “region-schedule-limit” parameter? Mine is a bit high, but yours is not.
“region-schedule-limit”: 2048,
“leader-schedule-limit”: 4,

“region-schedule-limit”: 2048, I feel this might be too high, lowering it a bit should alleviate the problem, right?

| username: jaybing926 | Original post link

TiKV didn’t crash. Upon checking the monitoring, I found that the query traffic was a bit high during this period, which coincides with the time of the alert. However, I don’t quite understand why the queries would cause this issue.

| username: jaybing926 | Original post link

The default threshold for this alert is a bit low; it can be increased without any issues. In my case, it’s too high and is caused by region miss.

| username: redgame | Original post link

If there is a region miss, you can adjust the number of region replicas and the leader election strategy.

| username: 江湖故人 | Original post link

The scheduling parameters are not high; these are all default values. It should be caused by abnormal KV read IO. Looking at the screenshot, the average is 500MB/s. If the QPS is normal, you can refer to the article organized by the cousin [1] to optimize slow SQL and reduce full table scans.

| username: jaybing926 | Original post link

In my cluster, 500M is quite normal. I saw that when br was backing up data, it reached 8G and didn’t cause this issue.

| username: IanWong | Original post link

Check the types of SQL statements during this period for any issues.

| username: IanWong | Original post link

It is recommended to find the cause of the large number of region scheduling during this time period, combined with the type of statements, such as:

  1. A large number of queries leading to resource balancing scheduling of regions
  2. A large number of inserts or region splits
  3. GC cleanup leading to empty regions
    If it is normal, you can limit region scheduling to alleviate the issue.
| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.