Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: 某个TIKV节点 频繁报错failed to send extra message
[TiDB Usage Environment] Production Environment
[TiDB Version] 5.1.2
[Reproduction Path]
The cluster first experienced slow queries. After checking the resources, everything was normal, and each node was running normally. Upon checking the logs, a certain TiKV node frequently reported errors, causing slow queries. What could be the reason for this? After restarting the TiKV node, errors still occurred, but the frequency of errors significantly decreased, and queries returned to normal. What could be causing this?
Specifically, the Transport(Full)
error indicates that TiKV’s internal message queue or network connection has reached its capacity limit for sending data. This could be due to several reasons, such as:
- Network Bottleneck: If the network bandwidth between TiKV nodes is limited or the network latency is high, it may cause the message queue to fill up.
- System Resource Limitation: Insufficient CPU, memory, or disk I/O resources on the server may prevent TiKV from processing or sending messages in a timely manner.
- Configuration Issues: Some configuration parameters of TiKV may be set unreasonably, such as the size of the message queue, send rate limits, etc.
- High Load: The TiKV cluster may be handling a large number of read and write requests, leading to internal message backlog.
To resolve this issue, you can try the following steps:
- Check Network Connection: Ensure that the network connection between TiKV nodes is stable and has sufficient bandwidth.
- Monitor Server Resources: Use monitoring tools (such as Prometheus, Grafana, etc.) to check the CPU, memory, and disk usage of TiKV servers to ensure that server resources are not exhausted.
- Adjust Configuration: Adjust TiKV’s configuration parameters according to the actual situation of the cluster, such as increasing the size of the message queue, adjusting the send rate limits, etc.
- Horizontal Scaling: If the cluster load is very high, consider adding more TiKV nodes to share the load.
- Check Logs: Examine TiKV’s log files in more detail, as they may contain more information about the cause and context of the error.
- Upgrade Version: If you are using an older version of TiKV, there may be known performance issues or defects. Upgrading to the latest version may resolve these issues.
Is PD functioning normally?