TiDB Cluster Suddenly Unavailable

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb集群突然不可用

| username: 苏半生Su

【TiDB Usage Environment】Production Environment
【TiDB Version】v5.4.0
【Reproduction Path】
【Encountered Problem: Phenomenon and Impact】
The business suddenly reported a fault, stating that the TiDB cluster was unavailable. Upon checking with tiup display, it was found that one KV node was disconnected. Restarting this node did not resolve the issue, but it recovered after 10 minutes.

KV node log information:
[2023/12/20 12:48:21.858 +08:00] [WARN] [gc_worker.rs:606] [“GcKeys fail”] [err=“Error(Other("[src/server/gc_worker/gc_worker.rs:341]: [components/raftstore/src/coprocessor/region_info_accessor.rs:622]: failed to send request to region collector: channel has been closed"))”]
[2023/12/20 12:48:21.858 +08:00] [WARN] [gc_worker.rs:606] [“GcKeys fail”] [err=“Error(Other("[src/server/gc_worker/gc_worker.rs:341]: [components/raftstore/src/coprocessor/region_info_accessor.rs:622]: failed to send request to region collector: channel has been closed"))”]
[2023/12/20 12:48:21.858 +08:00] [WARN] [gc_worker.rs:606] [“GcKeys fail”] [err=“Error(Other("[src/server/gc_worker/gc_worker.rs:341]: [components/raftstore/src/coprocessor/region_info_accessor.rs:622]: failed to send request to region collector: channel has been closed"))”]
[2023/12/20 12:48:21.858 +08:00] [WARN] [gc_worker.rs:606] [“GcKeys fail”] [err=“Error(Other("[src/server/gc_worker/gc_worker.rs:341]: [components/raftstore/src/coprocessor/region_info_accessor.rs:622]: failed to send request to region collector: channel has been closed"))”]
[2023/12/20 12:48:21.858 +08:00] [WARN] [gc_worker.rs:606] [“GcKeys fail”] [err=“Error(Other("[src/server/gc_worker/gc_worker.rs:341]: [components/raftstore/src/coprocessor/region_info_accessor.rs:622]: failed to send request to region collector: channel has been closed"))”]

| username: 像风一样的男子 | Original post link

Is it continuously unavailable or just a momentary connection issue?

| username: 春风十里 | Original post link

Are there any log prompts for automatic recovery?

| username: 普罗米修斯 | Original post link

Check the network monitoring of the machine during the disconnection period, etc.

| username: Sunward | Original post link

Check if there are any abnormal operations on that machine.

| username: 路在何chu | Original post link

Is there any content in the system logs?

| username: TIDB-Learner | Original post link

How many TiKV instances are there? How many replicas are customized? This situation seems to be related to PD network connection issues or region distribution imbalance.

| username: 江湖故人 | Original post link

Well, normally a single node issue shouldn’t cause a business outage.

| username: andone | Original post link

How many TiKV nodes?

| username: Inkjade | Original post link

  1. Check the network status to see if there are any network issues.
  2. Inspect the logs and gather cluster information.
  3. Check for any specific abnormal information.
| username: tidb菜鸟一只 | Original post link

With so many nodes, if one goes down, is it unusable?

| username: TIDB-Learner | Original post link

When the TiDB cluster encounters performance bottlenecks, should you increase the configuration of the existing machines or increase the number of instances? To be moderate, you need to find a balance point based on the situation. Just looking at the screenshot, I personally feel that the topic starter has too many KV nodes. This might make operations more troublesome or make it harder to pinpoint issues.

| username: Billmay表妹 | Original post link

According to user feedback, the issue has been identified.

| username: zxgaa | Original post link

The failure of just one node shouldn’t affect the entire cluster. Is it possible that all the leaders are concentrated on this node?

| username: 随缘天空 | Original post link

You have dozens of TiKV nodes, a single point of failure shouldn’t affect the cluster service. Did you perform any cluster restart operations or other special operations?

| username: oceanzhang | Original post link

Is it a network issue??? Or is the IO response too slow?

| username: 路在何chu | Original post link

With so many nodes, it seems better to reduce the number of nodes and increase the configuration of individual TiKV nodes.

| username: 苏半生Su | Original post link

Restored, network issue.

| username: dba远航 | Original post link

Check the connection status of the host.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.