Handshake error after adding load balance in front of TiDB

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb前加了loadbalance 后Handshake error错误

| username: TiDB_C罗

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version]
[Reproduction Path] Added Huawei Cloud load balance in front of TiDB
[Encountered Problem: Problem Phenomenon and Impact]
After adding Huawei Cloud’s load balancer, this metric increased. What is the reason?

| username: Billmay表妹 | Original post link

The “Handshake error” usually indicates an issue with the connection between the client and the TiDB server. When you add a load balancer in front of TiDB, there might be a misconfiguration in the load balancer causing the connection to fail.

Here are possible causes and solutions for the “Handshake error” message:

  1. Load Balancer Misconfiguration: Ensure the load balancer is correctly configured to forward traffic to the correct TiDB instance and port. You can refer to the TiDB documentation [1] for more information on how to configure the load balancer.
  2. Firewall or Network Issues: Check if there are firewall rules or network issues that might be blocking the connection between the client and the TiDB server. You can try pinging the TiDB instance from the client to see if there are any network issues.
  3. Client Misconfiguration: Ensure the client is configured to connect to the load balancer instead of directly connecting to the TiDB instance. You can refer to the TiDB documentation [2] for more information on how to configure the TiDB client.
  4. TiDB Misconfiguration: Ensure the TiDB instance is configured to accept connections from the load balancer. You can refer to the TiDB documentation [3] for more information on how to configure TiDB.

If the above solutions do not work, please provide more information about your environment and the error message, including the full error message and any relevant logs.

| username: redgame | Original post link

Please ensure that the load balancer is configured correctly and routes traffic properly to the TiDB instances.

| username: 像风一样的男子 | Original post link

If you are sure the configuration is correct, try turning off the health check.

| username: TiDB_C罗 | Original post link

  1. The ELB node sends a TCP SYN packet to the backend server (IP + health check port) based on the health check configuration.
  2. After the backend server receives the request packet, if the corresponding port is being properly listened to, it will return a SYN+ACK packet.
    • If the SYN+ACK packet from the backend server is not received within the timeout period, the health check is considered to have failed. Then, an RST packet is sent to the backend server to interrupt the TCP connection.
    • If the SYN+ACK packet is received within the timeout period, an ACK is sent to the backend server, the health check is considered successful, and an RST packet is sent to the backend server to interrupt the TCP connection.
      Sending an RST packet to reset the connection, if this metric in TiDB is stable, can it be ignored?
| username: 像风一样的男子 | Original post link

My Alibaba Cloud SLB frequently gives false alarms, so I turned off the health check.

| username: TiDB_C罗 | Original post link

I don’t dare to turn it off, so I’ll just leave it on. Knowing that this is the cause is enough. So far, I haven’t noticed any impact on other metrics.

| username: TiDB_C罗 | Original post link

  1. The client attempts to establish a TCP connection with a port on the server that is not providing external services, and the server will directly send a reset packet to the client.
  2. If an exception occurs on either the client or server side during interaction (such as a program crash), the system on that side will send a TCP reset packet to the other side, informing it to release the related TCP connection.
  3. If the receiving end receives a TCP packet but finds that the TCP packet is not in its established TCP connection list, it will directly send a reset packet to the other side.
  4. If one side of the interaction does not receive an acknowledgment packet from the other side for a long time, it will actively send a reset packet to the other side to release the TCP connection after exceeding a certain number of retransmissions or time.
  5. Some application developers design their application systems to use reset packets to quickly release TCP connections that have completed data interaction, thereby improving the efficiency of business interactions.

This belongs to the fifth situation.

| username: cassblanca | Original post link

Session persistence issue, right?

| username: ShawnYan | Original post link

It could also be an issue of protocol compatibility.