Issues in TiDB 7.1.0 Version: TSO Timeout, TiKV Connection Timeout, High CPU Load, and More

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb 7.1.0版本tidb-server出现获取tso超时、连接tikv超时、cpu负载过高等问题

| username: TiDBer_yyy

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version] 7.1.0
[Reproduction Path] Not reproduced
[Encountered Problem: Problem Phenomenon and Impact]
There are many abnormal errors on the TiDB node, and panic errors also occurred. Please kindly guide to troubleshoot the issue.

[2023/12/25 10:41:58.622 +08:00] [WARN] [pd.go:152] ["get timestamp too slow"] ["cost time"=50.144859ms]

[2023/12/25 10:42:20.048 +08:00] [ERROR] [tso_dispatcher.go:178] ["[tso] tso request is canceled due to timeout"] [dc-location=global] [error="[PD:client:ErrClientGetTSOTimeout]get TSO timeout"]
[2023/12/25 10:42:20.049 +08:00] [ERROR] [tso_dispatcher.go:453] ["[tso] getTS error"] [dc-location=global] [stream-addr=http://127.0.1.151:2379] [error="[PD:client:ErrClientGetTSO]rpc error: code = Canceled desc = context canceled: rpc error: code = Canceled desc = context canceled"]

[2023/12/25 10:42:20.152 +08:00] [WARN] [server.go:644] ["Server.onConn handshake"] [conn=1338767557106020929] [error="[server:8052]invalid sequence 0 != 1"] ["remote addr"=127.0.1.64:47920]
[2023/12/25 10:42:20.153 +08:00] [WARN] [server.go:644] ["Server.onConn handshake"] [conn=1338767557106020955] [error="[server:8052]invalid sequence 0 != 1"] ["remote addr"=127.0.1.64:48056]
[2023/12/25 10:42:20.152 +08:00] [WARN] [server.go:644] ["Server.onConn handshake"] [conn=1338767557106020871] [error="[server:8052]invalid sequence 0 != 1"] ["remote addr"=127.0.1.64:47632]
[2023/12/25 10:42:20.153 +08:00] [WARN] [server.go:644] ["Server.onConn handshake"] [conn=1338767557106020941] [error="[server:8052]invalid sequence 0 != 1"] ["remote addr"=127.0.1.64:48034]
[2023/12/25 10:42:20.153 +08:00] [WARN] [server.go:644] ["Server.onConn handshake"] [conn=1338767557106020937] [error="[server:8052]invalid sequence 0 != 1"] ["remote addr"=127.0.1.64:47954]
[2023/12/25 10:42:20.153 +08:00] [WARN] [server.go:644] ["Server.onConn handshake"] [conn=1338767557106020839] [error="[server:8052]invalid sequence 0 != 1"] ["remote addr"=127.0.1.64:47438]
[2023/12/25 10:42:20.153 +08:00] [WARN] [server.go:644] ["Server.onConn handshake"] [conn=1338767557106020949] [error="[server:8052]invalid sequence 0 != 1"] ["remote addr"=127.0.1.82:57506]
[2023/12/25 10:42:20.153 +08:00] [WARN] [server.go:644] ["Server.onConn handshake"] [conn=1338767557106020983] [error="[server:8052]invalid sequence 0 != 1"] ["remote addr"=127.0.1.64:48164]
[2023/12/25 10:42:20.153 +08:00] [WARN] [server.go:644] ["Server.onConn handshake"] [conn=1338767557106020945] [error="[server:8052]invalid sequence 0 != 1"] ["remote addr"=127.0.1.64:48044]
[2023/12/25 10:42:20.152 +08:00] [WARN] [server.go:644] ["Server.onConn handshake"] [conn=1338767557106020923] [error="[server:8052]invalid sequence 0 != 1"] ["remote addr"=127.0.1.64:47898]
[2023/12/25 10:42:20.153 +08:00] [WARN] [server.go:644] ["Server.onConn handshake"] [conn=1338767557106020861] [error="[server:8052]invalid sequence 0 != 1"] ["remote addr"=127.0.1.82:57118]

Subsequently, there was an alert for tidb-server panic.

| username: xfworld | Original post link

Could you provide more context?

| username: TiDBer_yyy | Original post link

There are issues such as panic and connection PD timeout errors, and I hope to find the cause of the problem.

During the problem, the network ping delay is as shown in the figure, with a ping delay of around 4ms to the PD node:

| username: dba远航 | Original post link

Check if PD is functioning properly and check the network.

| username: oceanzhang | Original post link

  1. Check if the network is connected and if there are latency issues.
  2. Check the performance issues of the PD nodes and resource usage.
  3. Check if there are any anomalies in the PD node logs.
| username: ffeenn | Original post link

From the logs in the first image, it can be seen that the connection to PD was interrupted halfway through the TSO segment transmission. Start investigating from the network layer to determine whether the network anomaly is caused by physical reasons or software reasons.

| username: tidb菜鸟一只 | Original post link

Check the resource usage of the leader PD.

| username: TiDBer_yyy | Original post link

The CPU is relatively idle during the issue.