ERROR 9005 (HY000): Region is unavailable

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: ERROR 9005 (HY000): Region is unavailable

| username: Qiao

Cluster topology


Currently testing a three-node cluster. After manually taking one node offline (shutting down the server), the database becomes unusable;
Error: ERROR 9005 (HY000): Region is unavailable
Why can’t high availability be achieved when one node fails?

| username: Minorli-PingCAP | Original post link

I see that PD and TiKV are deployed together. Shutting down the server causes not only TiKV to go offline, so you might encounter the above errors. For example, if the PD leader goes offline and there are TiKV or TiKV region leaders on it, it will take time to recover.

| username: xfworld | Original post link

  1. Is the configured number of replicas 3?

  2. Has the PD leader gone down?

If these two conditions are not met, the above errors will occur…


It is recommended to deploy a single instance for easier testing of availability.
If it’s just for experience, the current mixed deployment is fine.

| username: 考试没答案 | Original post link

Hello: Is it always unavailable? Or is it only unavailable for a short period of time?

| username: Qiao | Original post link

The default number of replicas is 3.

| username: Qiao | Original post link

The leader did not crash.

| username: Qiao | Original post link

It has been unavailable all the time. This is the distribution of the region.

| username: Qiao | Original post link

If the one that went offline is not the leader, do we need to wait for TiKV to synchronize itself before it can recover when encountering this error?

| username: h5n1 | Original post link

Why is the 192 segment in tiup, but the 10 segment in tikv_store_status? It keeps reporting “Region is unavailable,” which definitely indicates that the majority of region replicas have failed. What operations have been performed?

| username: Qiao | Original post link

No, the two screenshots are not from the same cluster, but the same tests were conducted, and the same situation occurred.

| username: h5n1 | Original post link

Find a smaller table that will report this error, use show table xxx regions to find the region_id, then use pd-ctl region xxx to check the output. Let’s take the 10 network segment as an example.

| username: Qiao | Original post link

It seems to be on the disconnected machine. Will this situation automatically recover?

| username: h5n1 | Original post link

Why is there only one replica? Have you adjusted the number of replicas? Use pd-ctl config show to check; max-replicas should be >= 3. Also, check the Grafana overview monitoring, specifically the leader and region monitoring under TiKV.

| username: xfworld | Original post link

It still looks like an environment issue… We need to check what operations were performed that caused this.

| username: Qiao | Original post link

I found the problem. I changed the replica count to 1. If I change it to 3, this issue should not occur, right?

| username: h5n1 | Original post link

It depends on whether your data is lost. If not, start TiKV and it will automatically balance after changing to 3 replicas. If the data is lost, you’ll have to redo it.

| username: Qiao | Original post link

Okay, thank you.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.