After Upgrading the TiDB Cluster, TiDB Keeps Reporting Error: write: connection reset by peer

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 升级TIDB集群后tidb一直报错 write: connection reset by peer

| username: jaybing926

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version]
[Reproduction Path] What operations were performed to encounter the issue
[Encountered Issue: Problem Phenomenon and Impact]
Upgraded cluster version from v4.0.9 to v5.4.3. After the upgrade, the TiDB logs report a large number of errors:

[stack="github.com/pingcap/tidb/parser/terror.Log\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/parser/terror/terror.go:307\ngithub.com/pingcap/tidb/server.(*Server).onConn\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/server.go:516"]
[2023/04/21 11:06:12.400 +08:00] [ERROR] [terror.go:307] ["encountered error"] [error="write tcp 192.168.241.72:4000->192.168.241.55:21118: write: connection reset by peer"] [stack="github.com/pingcap/tidb/parser/terror.Log\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/parser/terror/terror.go:307\ngithub.com/pingcap/tidb/server.(*Server).onConn\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/server.go:516"]
[2023/04/21 11:06:12.444 +08:00] [ERROR] [terror.go:307] ["encountered error"] [error="write tcp 192.168.241.72:4000->192.168.241.55:21123: write: connection reset by peer"] [stack="github.com/pingcap/tidb/parser/terror.Log\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/parser/terror/terror.go:307\ngithub.com/pingcap/tidb/server.(*Server).onConn\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/server.go:516"]
[2023/04/21 11:06:12.507 +08:00] [ERROR] [terror.go:307] ["encountered error"] [error="write tcp 192.168.241.72:4000->192.168.241.54:40415: write: connection reset by peer"] [stack="github.com/pingcap/tidb/parser/terror.Log\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/parser/terror/terror.go:307\ngithub.com/pingcap/tidb/server.(*Server).onConn\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/server.go:516"]
[2023/04/21 11:06:12.519 +08:00] [ERROR] [terror.go:307] ["encountered error"] [error="write tcp 192.168.241.72:4000->192.168.241.54:40416: write: connection reset by peer"] [stack="github.com/pingcap/tidb/parser/terror.Log\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/parser/terror/terror.go:307\ngithub.com/pingcap/tidb/server.(*Server).onConn\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/server.go:516"]

I searched on asktug and found many similar issues. Most of them resolved it by disabling the load balancer’s health check, but there doesn’t seem to be a final solution. Some experts on asktug suggested modifying the HAProxy’s health check port, but they didn’t specify how to implement it. The official documentation also doesn’t explain the corresponding HAProxy configuration.

I didn’t encounter this issue in v4.0.9, but I did after upgrading to v5.4.3. I want to ask if this is a bug? Has it been resolved in v6.5.1? Because my target version is v6.5.1.

Below is my HAProxy configuration file, which is also based on the official configuration:

# cat /etc/haproxy/haproxy.cfg
global                                     
   log         127.0.0.1 local2            
   chroot      /var/lib/haproxy            
   pidfile     /var/run/haproxy.pid        
   maxconn     4000                        
   user        haproxy                     
   group       haproxy                     
   nbproc      10                          
   daemon                                  
   stats socket /var/lib/haproxy/stats     

defaults                                   
   log global                              
   retries 2                               
   timeout connect  2s                     
   timeout client 30000s                   
   timeout server 30000s                   

listen admin_stats                         
   bind 192.168.241.54:18080                       
   mode http                               
   option httplog                          
   maxconn 10                              
   stats refresh 30s                       
   stats uri /haproxy                      
   stats realm HAProxy                     
   stats auth admin:UXnxFu5Mxxxxxxxxxxxx
   stats hide-version                      
   stats  admin if TRUE                    

listen tidb-xxxxx
   bind 0.0.0.0:14000
   mode tcp                                
   balance leastconn                       
   server tidb-71 192.168.241.71:4000 send-proxy  check inter 2000 rise 2 fall 3
   server tidb-72 192.168.241.72:4000 send-proxy  check inter 2000 rise 2 fall 3
   server tidb-73 192.168.241.73:4000 send-proxy  check inter 2000 rise 2 fall 3
| username: xingzhenxiang | Original post link

v4.0.9 can be directly upgraded to v6.5.1

| username: jaybing926 | Original post link

I know it can be upgraded directly. Has this issue been resolved in v6.5.1?

| username: xingzhenxiang | Original post link

I checked my logs, and there is no such error.

| username: jaybing926 | Original post link

This post’s reply from the expert solved my problem.
As follows:
Using the “port” parameter, it becomes possible to use a different port to send health-checks. On some servers, it may be desirable to dedicate a port to a specific component able to perform complex tests which are more suitable to health-checks than the application. It is common to run a simple script in inetd for instance. This parameter is ignored if the “check” parameter is not set. See also the “addr” parameter.

server tidb-1 192.168.0.1:4000 port 10080 check ??

My configuration modification: Add the port 10080 parameter to specify the detection port

   server tidb-71 192.168.241.71:4000 send-proxy  check port 10080 inter 2000 rise 2 fall 3
   server tidb-72 192.168.241.72:4000 send-proxy  check port 10080 inter 2000 rise 2 fall 3
   server tidb-73 192.168.241.73:4000 send-proxy  check port 10080 inter 2000 rise 2 fall 3

haproxy has a way to specify the monitoring port:
http://docs.haproxy.org/1.7/configuration.html#5.2-port

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.