Region is unavailable

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: Region is unavailable

| username: TiDBer_BraiRIcV

[TiDB Usage Environment] Production Environment / Testing / PoC
Production Environment
[TiDB Version] v5.4.2
[Encountered Problem]
Using DM to synchronize from upstream MySQL to TiDB, suddenly the synchronization delay increased yesterday.

Looking at the tidb cluster_processlist process, a bunch of write operations are piling up
REPLACE INTO weplayzy.dd_recommend_new (

Looking at the TiDB logs, it indicates Region is unavailable

[2022/11/02 15:29:58.257 +08:00] [WARN] [backoff.go:152] ["regionMiss backoffer.maxSleep 40000ms is exceeded, errors:\nepoch_not_match:<>  at 2022-11-02T15:29:56.63521457+08:00\nepoch_not_match:<>  at 2022-11-02T15:29:57.528003335+08:00\nepoch_not_match:<>  at 2022-11-02T15:29:58.257138997+08:00"]
[2022/11/02 15:29:58.257 +08:00] [WARN] [session.go:1719] ["run statement failed"] [conn=5429] [schemaVersion=423853] [error="[tikv:9005]Region is unavailable"] [session="{\n  \"currDBName\": \"\",\n  \"id\": 5429,\n  \"status\": 3,\n  \"strictMode\": false,\n  \"txn\": \"437092125073735682\",\n  \"user\": {\n    \"Username\": \"tidb_user\",\n    \"Hostname\": \"10.0.8.183\",\n    \"CurrentUser\": false,\n    \"AuthUsername\": \"tidb_user\",\n    \"AuthHostname\": \"%\"\n  }\n}"]
[2022/11/02 15:29:58.257 +08:00] [INFO] [conn.go:1117] ["command dispatched failed"] [conn=5429] [connInfo="id:5429, addr:10.0.8.183:30432 status:11, collation:utf8mb4_general_ci, user:tidb_user"] [command=Query] [status="inTxn:1, autocommit:1"] [sql="REPLACE INTO `weplayzy`.`dd_recommend_new` (`id`,`god_uid`,`category_id`,`online_score`,`order_score`,`time_score`,`pay_score`,`gift_score`,`price_score`,`game_price_score`,`fun_price_score`,`single_score`,`expired_score`,`speech_score`,`createtime_score`,`star_score`,`god_click_score`,`chat_num_score`,`chat_rate_score`,`room_click_score`,`room_enter_pay_score`,`order_all_score`,`score`,`score_rank`,`final_rank`,`nickname`,`update_time`) VALUES (12714,15082961,56,0,0,0,0,0,0.5599651414767486,0.5599651414767486,0,0,0,0,0.7223192902226421,0,0,0,0,0,0,0,0,3144,3144,_binary'兔啊兔',_binary'2022-11-01 14:00:48')"] [txn_mode=OPTIMISTIC] [err="[tikv:9005]Region is unavailable"]

At this time, querying weplayzy.dd_recommend_new is stuck
select * from weplayzy.dd_recommend_new limit 10;

Is the data of weplayzy.dd_recommend_new located in an unavailable Region? How to handle this situation?

| username: xfworld | Original post link

Is the old region split into two new ones? Observe and check first.

| username: TiDBer_BraiRIcV | Original post link

The synchronization delay has lasted for 24 hours since yesterday afternoon. Is the region splitting this slow?

| username: TiDBer_BraiRIcV | Original post link

dd_recommend_new is found to be in the same region. If it is during the region split, will queries to this table be stalled?

| username: xfworld | Original post link

Something seems off. Region splitting and scheduling should be very fast (at most one failure will trigger a backoff, and once splitting and scheduling are completed, it should be fine).

I suggest checking the status of each TiKV node, as well as the health and replica count of the relevant regions.

| username: h5n1 | Original post link

Client Reports Region is Unavailable Error

  • 1.1.1 Region is Unavailable usually occurs when a Region is unavailable for a period of time (you might encounter TiKV server is busy; or requests sent to TiKV are rejected due to not leader or epoch not match, etc.; or requests to TiKV timeout). TiDB will internally perform a backoff retry. If the backoff time exceeds a certain threshold (default is 20s), an error will be reported to the client. If the backoff is within the threshold, the client will not be aware of the error.
  • 1.1.2 Multiple TiKV instances running out of memory (OOM) simultaneously, causing the Region to have no Leader during the OOM period. See case case-991.
  • 1.1.3 TiKV reports TiKV server is busy error, exceeding the backoff time. Refer to 4.3 Client Reports server is busy Error. TiKV server is busy is part of the internal flow control mechanism and may not be counted in the backoff time in the future.
  • 1.1.4 Multiple TiKV instances fail to start, causing the Region to have no Leader. Deploying multiple TiKV instances on a single physical host, if the physical host fails and due to incorrect label configuration, the Region has no Leader. See case case-228.
  • 1.1.5 Follower apply lags behind, and after becoming the Leader, it rejects received requests with epoch not match reason. See case case-958 (TiKV needs to optimize this mechanism internally).
| username: TiDBer_BraiRIcV | Original post link

Do you have a specific solution?

| username: h5n1 | Original post link

Has the amount of upstream data changes varied?

| username: TiDBer_BraiRIcV | Original post link

The data volume of this table hasn’t changed much. The problem is that this Region unavailable causes the replace dd_recommend_new to be very slow (nearly 1 minute per execution), and I don’t know how to handle it.

| username: h5n1 | Original post link

This kind of internal mechanism is not easy to intervene manually. Check the status of TiKV during the problem period to see if it is busy.

| username: TiDBer_BraiRIcV | Original post link

Since TiKV has no load, can the table’s region be modified?

| username: h5n1 | Original post link

You can manually split regions

Or for a specific region pd-ctl operator add split-region <region_id> --policy=approximate

| username: TiDBer_BraiRIcV | Original post link

Thank you, restarting the TiDB cluster and re-syncing with DM resolved the issue.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.