Sync_diff_inspector compares data and prompts "Region is unavailable"

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: sync_diff_inspector对比数据,提示Region is unavailable

| username: TiDBer_OB4kHrS7

【TiDB Usage Environment】Production Environment
【TiDB Version】V5.3.3
【Reproduction Path】sync_diff_inspector comparing disaster recovery environment data
【Encountered Problem: Problem Phenomenon and Impact】
sync_diff_inspector comparing data between the primary database and the disaster recovery environment, prompts “Region is unavailable,” and the comparison program exits
【Resource Configuration】Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
【Attachments: Screenshots/Logs/Monitoring】

| username: Anna | Original post link

If the load pressure on TiFlash is too high, it may cause TiFlash data synchronization to lag, and some queries may return a Region Unavailable error.

In this case, you can add more TiFlash nodes to share the load pressure.

| username: Anna | Original post link

Please take a look: 专栏 - Region is unavailable的排查总结 | TiDB 社区

| username: TiDBer_OB4kHrS7 | Original post link

There are no TiFlash nodes, only TiKV nodes.

| username: Anna | Original post link

The “region is unavailable” error in TiDB occurs when the backoff time exceeds the threshold (20 seconds) due to various issues. Common causes include:

  1. More than half of the TiKV or TiFlash replicas are unavailable or restarted simultaneously, causing raftgroup multi-replica failures. Note that the TiDB cluster allows the number of problematic TiKVs based on the majority of replicas being available, regardless of the number of hosts running TiKV.

  2. No leader is accessible within the backoff time:

    (1) TiKV is very busy, and the region does not elect a leader within the backoff time;

    (2) The region has issues and cannot elect a leader;

    (3) Region split takes too long.

  3. Region split/merge fails to synchronize the split/merge operation within the backoff time after a leader switch when the follower apply is slow.

  4. Other situations: such as incomplete version upgrades, bugs, etc. For example: TiDB 5.3.3 Release Note | PingCAP 文档中心

| username: TiDBer_OB4kHrS7 | Original post link

Which specific monitoring metrics should we look at?

| username: 像风一样的男子 | Original post link

Set the TSO in the configuration file.

| username: TiDBer_OB4kHrS7 | Original post link

I don’t understand, which configuration file should be set with tso?

| username: 像风一样的男子 | Original post link

The snapshot needs to keep the timelines of the two databases consistent.

| username: xingzhenxiang | Original post link

It feels like there aren’t enough resources.

| username: TiDBer_OB4kHrS7 | Original post link

The time is consistent; if it is not consistent, an error is reported directly, not just when comparing.

| username: TiDBer_OB4kHrS7 | Original post link

From the monitoring, the resources are sufficient, and no related bottlenecks are observed.

| username: jansu-dev | Original post link

  1. The region is unavailable, which should not be visible from sync_diff. This error is returned by TiDB/TiKV.
  2. The implementation of sync_diff is relatively simple, equivalent to a simple executable file. If it breaks, it breaks. Most errors are returned by other processes. You can check tidb.log and tikv.log to see if there is any reason for the unavailability.
  3. According to Anna’s shared column, troubleshooting is also a feasible method.
| username: TiDBer_OB4kHrS7 | Original post link

From the tidb.log and tikv.log logs, no relevant errors were found. It’s just that this database has a relatively large number of tables, nearly 9000 tables. Every time, whether during busy or idle times, when comparing data, it takes 15 minutes to execute, and then this error occurs.

| username: TiDBer_OB4kHrS7 | Original post link

Can this error indicate whether it is a source or target error?

| username: jansu-dev | Original post link

It should be an issue triggered during processing on the source end.


From the stack trace, the problem occurred when stepping down from this line:
AnalyzeSplitter --》 NewRandomIteratorWithCheckpoint
Although I haven’t looked into it in detail, it should be sending some SQL to TiDB; it’s strange that there are no clues.
You can also enable sync_diff debug level log to check.

| username: TiDBer_OB4kHrS7 | Original post link

The above is the content of the sync_diff log. The debug log should not be enabled; if it were, it would be quite large. When the comparison was normal before, 8 threads were used for the comparison, and it took 9 hours to complete.

| username: jansu-dev | Original post link

  1. Currently investigating the issue. If it can be consistently reproduced, it shouldn’t take 9 hours.
  2. If it cannot be consistently reproduced, we need to follow up on that function (metrics, full tidb.log from the upstream time change point, and full sync_diff log all need to be reviewed). The general approach is to follow the ideas provided by the teachers above, eliminate possibilities one by one, or look at what requests (SQL) this function sends to guess and eliminate.
| username: TiDBer_OB4kHrS7 | Original post link

Now it can be stably reproduced, and this error will definitely occur after running for 15 minutes. It has been running for several days, and it feels quite difficult to eliminate this fault. Also, I don’t dare to run comparison tasks during the day.

| username: jansu-dev | Original post link

Okay, then let’s run it tonight. Actually, running it is just to gather more information points. Directly following approach 2 should also reveal some issues (metrics, full tidb.log at the time point when the upstream changes, and full sync_diff log all need to be checked).