TiKV Node Unresponsive for a Short Period

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiKV节点短时间内无响应

| username: EricSong

[TiDB Usage Environment] Production Environment
[Encountered Problem: Phenomenon and Impact]
A large number of query and insert errors occurred in a short period of time
Error querying database. Cause: java.sql.SQLException: TiKV server timeout
Through monitoring, it was found that during this period, a certain TiKV node experienced a leader drop to 0, and there was a surge in scheduler pending commands.
It looks like the TiKV node crashed, but from the memory and CPU perspective, the node did not experience OOM or CPU saturation.
What could be the possible reasons?
[Attachments: Screenshots/Logs/Monitoring]
Leader Drop


Scheduler pending command

CPU&Mem

| username: h5n1 | Original post link

Take a look at the TiKV logs.

| username: EricSong | Original post link

The error mainly occurred around 7.25 at 20:20, similar to the following error:

[2023/07/25 12:19:14.837 +00:00] [Error] [kv.rs:603] [“KvService::batch_raft send response fail”] [err=RemoteStopped]

| username: h5n1 | Original post link

What operations did this TiKV perform before? There are always many error messages in the logs.

| username: EricSong | Original post link

There have been no maintenance operations on this cluster recently; all other operations are either write or query operations.

| username: EricSong | Original post link

If you need logs or monitoring from other time periods, you can let me know directly. I’ll see if I can find them. I have only captured the monitoring and logs from the problematic time period.

| username: MrSylar | Original post link

How is the network situation between TiDB Server and TiKV Server?

| username: EricSong | Original post link

From the monitoring of the faulty node, the network connection appears to be normal, but there was a sudden surge in TCP and traffic during the problematic period.

| username: MrSylar | Original post link

Gigabit network card?

| username: h5n1 | Original post link

Take a look at the latency blackexporter monitoring.

| username: EricSong | Original post link

It looks like the ping during the problem period is also normal.

| username: EricSong | Original post link

You need to ask the equipment maintenance personnel about this. However, do you mean that the network card might be fully utilized due to high traffic?

| username: MrSylar | Original post link

It is suspected that the network bandwidth is fully utilized, causing slow network interaction between tidb-server and tikv-server. Moreover, from the monitoring graph above, the TCP and ping curves at the corresponding time are indeed abnormal compared to the times before and after. Just log in to the server and check with ethtool, it’s more efficient than calling someone.

| username: EricSong | Original post link

I tried it on my side, and because it’s a virtual network card, there is no available information.

[root@sjc-tidb-tikv8 tidb]# ethtool eth0
no stats available

The device hasn’t responded yet, and it feels like we’ve hit a dead end. Are there any other troubleshooting directions? For example, what operation exactly is saturating the bandwidth? I noticed a sharp increase in the batch_get command during the problem period. Could this be the cause?

| username: MrSylar | Original post link

“No stats available”: This can be traced to see if there is any issue with the underlying physical network card (the starting point is that the underlying physical network card affects the upper layer).
“batch_get”: I think we can check if the select operations of tidb-server during that time period have significant changes compared to the normal state, on the granfa’s tidb page. I don’t think this is the direction, after all, the CPU/mem usage rate is not high.

| username: redgame | Original post link

It looks like there is a network connection failure or delay between TiKV nodes, causing request timeouts. Please check the network configuration and connections to ensure network stability.