Issues with TiDB Cluster Network Disconnection Testing

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIDB集群断网测试问题

| username: lutong

Three servers (201, 202, 203) form a TiDB cluster, with each server deploying TiDB/PD/TiKV/TiFlash. To test the high availability of the cluster, a network disconnection test was conducted:

  • 2022-08-31 22:36: Disconnected the network of server 202 for testing. After connecting with the database connection tool, the query reported “Region is unavailable” (this error was reported when querying through haproxy’s 3390, single node 201:4000, and 203:4000 ports).
  • 2022-08-31 23:40: Disconnected the network of server 201 for testing. The database connection tool query was normal (queries through haproxy’s 3390, single node 202:4000, and 203:4000 ports were all normal).
  • 2022-08-31 23:53: Disconnected the network of server 203 for testing. The database connection tool query was normal (queries through haproxy’s 3390 and single node 4000 ports were all normal).
  • 2022-09-01 00:21: After reconnecting the network of all servers, the overall log was tested. The database connection tool query was normal (queries through haproxy’s 3390 and single node 4000 ports were all normal). Attached are the logs of each server Server Logs 20220901.rar (1.5 MB)

Upgraded to v5.4.2 as suggested by @小王同学Plus, but the issue still persists.

| username: xiaohetao | Original post link

Node 202, which is disconnected from the network, should have some regions with the leader role. You can check if there were any abnormal messages before the disconnection.

| username: xiaohetao | Original post link

The image is not visible. Please provide the text you need translated.

| username: xiaohetao | Original post link

The kv logs of node 202 already reported errors before 22:36 on the 28th and 31st. This node should have had issues before the network disconnection.

Or did node 202 lose network connection at 33 minutes?

| username: jansu-dev | Original post link

  1. First, the network was disconnected around 22:33:00, and the leaders of other stores did not increase, which triggered the issue of “Region is unavailable”;

  2. The disappearance of the Region heartbeat report was expected due to the network disconnection.

  3. The absence of balance-leader operator was unexpected and also triggered point 1;

  4. During the failure period of 202, the PD leader was 203, and it was not the PD leader that was killed.

  5. Looking closely at the log, after removing some useless logs using expressions, the content is as follows: 203 rpc failed to connect to itself, but in fact, the network connection to itself should be fine:

In summary, it should be this bug → Master: two tikv don't report region heartbeat after inject fault to pd leader · Issue #12934 · tikv/tikv · GitHub (this BUG affects cluster behavior in versions 5.3 and above). After restarting, this BUG will cause Region Heartbeat Pending, which matches the phenomenon of this issue and coincides with the recovery time point.


You can try using this hotfix → Release v5.4.2-20220802: pd-client: pd client should update if the grpc stream sender failed. … · tikv/tikv · GitHub to see if the issue can be reproduced.

| username: Kongdom | Original post link

New TiDB Diagnostic Tool

| username: Kongdom | Original post link

:+1::+1::+1: Great job!

| username: xiaohetao | Original post link

:+1::+1::+1:

I want to know what the tool you use to view logs is called?

| username: jansu-dev | Original post link

clinic, after logging in (and clicking the link provided by the questioner), you will be able to see the data they uploaded.

| username: xiaohetao | Original post link

:handshake::+1:

| username: lutong | Original post link

I followed your method to upgrade, but the problem still exists after the upgrade.

| username: jansu-dev | Original post link

For the new version, please provide another clinic. My machine is not sufficient to reproduce the issue as it is.

  1. Is it always this node that has the problem? If so, kill this node’s PD and then collect the clinic to avoid interference from other information.
  2. If only this node has the problem, you can ask the network team if there is anything special about this node.
  3. The strange point now is that killing the leader does not report region unavailable, but killing an irrelevant PD reports an error. However, monitoring shows that the region has not been scheduled.
  4. If you are willing to try, you can also configure a Label for them to see if it bypasses the issue (this is just an attempt because regions are not scheduled to the same host by default, it feels like there is some problem with the scheduling).
| username: lutong | Original post link

Thank you, boss.

  1. This is the latest clinic address Download URL: https://clinic.pingcap.com.cn/portal/#/orgs/64/clusters/6846941113233660477

  2. Last night, when upgrading the patch, I executed it like this: tiup cluster patch test-cluster /opt/tikv-5.4.2-20220802.tar.gz -R tikv

| username: jansu-dev | Original post link

  1. Is the cluster set to log.level == error? It’s hard to see the specific behavior of the components :rofl:, for example, when the PD leader switch is completed. Some information points need to be viewed at the info or even debug log level.

  2. Why was the previous scheduler abnormal?

  3. There seems to be no problem with the patch command. You can go to the TiKV deploy directory and use ./tikv-server --version to check if the patch was successful.

| username: lutong | Original post link

Last night, I changed the log levels of TiKV and TiDB to debug mode and conducted a network disconnection test. I have extracted the new logs. Please help analyze them: https://clinic.pingcap.com.cn/portal/#/orgs/64/clusters/6846941113233660477

| username: lutong | Original post link



| username: jansu-dev | Original post link

  1. This time, the scheduler looks normal. By the way, do you often use ctrl + c during testing? This can easily leave behind evict-leader-scheduler, or the network disconnection test intervals are too frequent.

  2. The internal balance-leader mechanism in PD is actually running, but it hasn’t generated a real balance leader operator.

  3. First, remove the evict-leader this time, then check each panel to ensure there are no anomalies. Disconnect the network (only one machine), then use the following statement to check. See if it can be reproduced. If it can be reproduced, check again after 15 minutes. If it still can be reproduced, use the following statement to check the results. Finally, restore the network. After the cluster returns to normal, take the clinic results from 10 minutes before the disconnection and 10 minutes after the cluster returns to normal, as well as several results from trace select ......

trace select * from XXX;

By the way, this time the clinic did not include information before the disconnection.
Each clinic gives a slightly different feeling :joy: It doesn’t seem like the same issue.

| username: Kongdom | Original post link

Is the version information above correct? Last time, a node was patched.

| username: jansu-dev | Original post link

It doesn’t quite match. This PR was fixed on July 27, but the build time of the binary is July 06.

| username: Kongdom | Original post link

Got it, we’ll reapply the patch :handshake: