TiFlash executing SQL reports region is unavailable

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tiflash 执行sql报region is unavilable

| username: qiuxb

【TiDB Usage Environment】Production Environment or Test Environment or POC
【TiDB Version】
v4.0.8
【Problem Encountered】
Executing certain aggregate queries in SQL results in “region is unavailable.” After checking the SQL execution plan, it shows that the queries are “sum group by” and are automatically assigned to TiFlash for execution.
【Reproduction Path】What operations were performed to encounter the problem
【Problem Phenomenon and Impact】
Executing a certain aggregate calculation SQL, mainly “sum group by,” with a data volume of about 3 million, reports “region is unavailable.”

At the same time, the TiFlash error logs contain the following error messages:

2.09.11 22:08:38.041389 [329026351] CoprocessorHandler: grpc::Status DB::CoprocessorHandler::execute(): RegionException: region 1683245, message: NOT_FOUND: (while creating InputStreams from storage db_48.t_1858, table_id: 1858)
2022.09.11 22:08:42.969253 [329021362] DAGQueryBlockInterpreter: Check after read from Storage, region 1683240, version 1560, handle range [2799039587, 2799056972), status VERSION_ERROR
2022.09.11 22:08:42.974485 [329021362] DAGQueryBlockInterpreter: RegionException after read from storage, regions [1683240,], message: VERSION_ERROR, retry to read from local
2022.09.11 22:08:43.182245 [329021421] DAGQueryBlockInterpreter: Check after read from Storage, region 1683240, version 1560, handle range [2799039587, 2799056972), status VERSION_ERROR
2022.09.11 22:08:43.206093 [329021421] DAGQueryBlockInterpreter: RegionException after read from storage, regions [1683240,], message: VERSION_ERROR, retry to read from local
2022.09.11 22:08:46.022185 [329033088] DAGQueryBlockInterpreter: Check after read from Storage, region 1683240, version 1560, handle range [2799039587, 2799056972), status VERSION_ERROR
2022.09.11 22:08:46.022305 [329021933] DAGQueryBlockInterpreter: Check after read from Storage, region 1683240, version 1560, handle range [2799039587, 2799056972), status VERSION_ERROR
2022.09.11 22:08:46.040664 [329033088] DAGQueryBlockInterpreter: RegionException after read from storage, regions [1683240,], message: VERSION_ERROR, retry to read from local
2022.09.11 22:08:46.053768 [329021933] DAGQueryBlockInterpreter: RegionException after read from storage, regions [1683240,], message: VERSION_ERROR, retry to read from local
2022.09.11 22:09:02.994684 [329037158] pingcap.tikv: Failed4: Deadline Exceeded
2022.09.11 22:09:03.233701 [329037232] pingcap.tikv: Failed4: Deadline Exceeded

【Attachments】Related logs and monitoring (https://metricstool.pingcap.com/)


For questions related to performance optimization and fault troubleshooting, please download the script and run it. Please select all and copy-paste the terminal output results and upload them.

| username: qiuxb | Original post link

To add some information: TiFlash has only one node, the host CPU usage often runs at full capacity, and the memory usage is about 50%.

| username: xiaohetao | Original post link

You can check the health status of some tables and regions, specifically check 2799039587 and 2799056972.

The health status of regions can be viewed through Grafana.

| username: HACK | Original post link

Check the time when the error occurred, the resource usage of the node, and the concurrency at that time. It is highly likely related to your resource usage.

| username: alfred | Original post link

Are there any error logs on the KV node? Has the old version been garbage collected?

| username: qiuxb | Original post link

How to confirm the health information of a region? It can be the health of the table, with some partitions around 50 and most between 70-100.

| username: xiaohetao | Original post link

Monitoring of regions:
Log in to Grafana monitoring: PD —> Region health dashboard
Example: empty-region-count: Number of empty regions. If there are too many empty regions, they need to be merged.

| username: xiaohetao | Original post link

Check what table_id: 1858 is, and see which kv nodes the corresponding region is located on, then inspect the disk as follows:

Are you using SSDs? It is recommended that you check the disks of some tikv kv nodes to see if there are any bad sectors.
Has this cluster ever had nodes deleted or experienced any kv disk anomalies?

| username: qiuxb | Original post link

The image you provided is not visible. Please provide the text content you need translated.

| username: qiuxb | Original post link

If you don’t perform the merge, will it trigger the “region is unavailable” error? When I manually execute this SQL, it occasionally reproduces the issue, sometimes fails, and doesn’t always fail.

| username: qiuxb | Original post link

I previously took a TiFlash node offline, and its status changed to Tombstone. However, after a night, when the disk usage decreased, it automatically recovered. I’m not very familiar with checking the KV nodes where the regions are located. How do I check this?

| username: qiuxb | Original post link

KV node logs only contain some WARN:
[2022/09/13 16:52:51.508 +08:00] [WARN] [endpoint.rs:527] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 1676462, leader may Some(id: 1676465 store_id: 4)" not_leader { region_id: 1676462 leader { id: 1676465 store_id: 4 } }”]
[2022/09/13 16:52:51.509 +08:00] [WARN] [endpoint.rs:527] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 1694835, leader may Some(id: 1694838 store_id: 4)" not_leader { region_id: 1694835 leader { id: 1694838 store_id: 4 } }”]
[2022/09/13 16:52:51.940 +08:00] [WARN] [endpoint.rs:527] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 1672718, leader may Some(id: 1672721 store_id: 4)" not_leader { region_id: 1672718 leader { id: 1672721 store_id: 4 } }”]
[2022/09/13 16:52:51.943 +08:00] [WARN] [endpoint.rs:527] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 1676462, leader may Some(id: 1676465 store_id: 4)" not_leader { region_id: 1676462 leader { id: 1676465 store_id: 4 } }”]
[2022/09/13 16:52:51.943 +08:00] [WARN] [endpoint.rs:527] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 1694835, leader may Some(id: 1694838 store_id: 4)" not_leader { region_id: 1694835 leader { id: 1694838 store_id: 4 } }”]

| username: xiaohetao | Original post link

There are nearly 5000 empty regions. If possible, it is recommended to merge them. Refer to this:

How to measure if there are too many empty regions in a TiDB cluster

| username: qiuxb | Original post link

The main impact of empty regions on the entire cluster is where? Is it directly related to the SQL execution error “region is unavailable”?

| username: xiaohetao | Original post link

There is some suspicion that the region information was not cleaned up properly when the node was taken offline.

  1. Confirm if it is left over by TiFlash:
select * from TIFLASH_TABLES where TABLE_ID='1858';

If there are results, then this table is in TiFlash.

  1. If there are no results, check KV:
  • To find the table name:
select * from TABLES where TIDB_TABLE_ID='1858';
  • To find the region ID of the table:
show table table_name regions;
  • To find the node where the region is located based on the region ID:
select a.region_id, a.peer_id, a.store_id, b.address from tikv_region_peers a, tikv_store_status b where a.store_id=b.store_id and a.region_id='the region id found above';
| username: xiaohetao | Original post link

There should be no direct connection between empty regions and region is unavailable.

Impact of empty regions:

| username: qiuxb | Original post link

  1. Confirm if it is left by TiFlash:
select * from TIFLASH_TABLES where TABLE_ID='1858';

If there are results, then this table is in TiFlash.

This query has results, and the current storage exists in both TiKV and TiFlash. The TiFlash node was previously taken offline just for replacement (due to physical machine failure, a node was expanded, and the faulty physical machine node was reclaimed). The TiFlash node still exists and is providing service.

| username: xiaohetao | Original post link

  1. Confirm whether the current SQL execution statistics for this table are using TiFlash or TiKV.
  2. Collect statistics for the table.
  3. Execute the SQL again and check the execution plan and whether the SQL execution reports any errors.

Another possibility is:
There are too many small regions. When PD schedules, this region does exist, but when you execute the SQL, this small region has been merged. In this way, the previous region cannot be found.

| username: qiuxb | Original post link

The execution plan shows that both TiFlash and TiKV are used. The table is a partitioned table with partitions spanning 3 years, partitioned by day, with around 4 million rows of data per day. Collecting statistics for such a large table would be very slow, right? I haven’t dared to manually collect statistics before.

| username: qiuxb | Original post link

It was found that the memory of the TiFlash host is also faulty. The plan is to first expand a TiFlash node and then take the current node offline.