Requirements of TiDB for Network Latency or Packet Loss

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIDB 对网络延迟or丢包的要求

| username: residentevil

【TiDB Usage Environment】Production Environment
【TiDB Version】V6.1.7
【Encountered Issue: Problem Phenomenon and Impact】TiDB has a storage-compute separation architecture. Network latency and packet loss can significantly impact the performance of the entire cluster. It would be helpful to have some recommended and empirical values, such as: the ping latency between the physical machines of the three modules should be within x ms, and the number of packet losses should be less than x, etc.

| username: hey-hoho | Original post link

Reference values for experience:
Latency within a single DC should be less than 0.5ms, between two DCs in the same city should be less than 2ms, and between remote locations should be less than 5ms. The fewer packet losses, the better.

| username: zhanggame1 | Original post link

It would be best if it can be controlled to 1ms.

| username: Fly-bird | Original post link

Suggest 1ms

| username: xfworld | Original post link

Still experiencing packet loss… If W megabit network is losing packets, it’s time to call it a day… :rofl:

| username: residentevil | Original post link

Overall, it can be guaranteed within 1ms, but considering that TiDB is a distributed architecture, especially since PD and TiKV are implemented with RAFT, there is a concern that if one machine experiences network issues, it could affect the performance of the entire cluster.

| username: residentevil | Original post link

It seems that network monitoring in this area needs to be done by having each physical machine periodically ping each other to check for delays and determine if there are any anomalies.

| username: 人如其名 | Original post link

  1. It is best not to deploy clusters across data centers. If you must deploy across data centers within the same city, try to keep the distance within 50 kilometers. It is recommended to use CDC to synchronize multiple clusters for multi-center deployments within the same city. For cross-center deployments, it is strongly advised to use enterprise services. After all, if you are dealing with multiple centers, the cost should not be a significant issue. Otherwise, you may encounter numerous problems if you handle it yourself.

  2. Within a single data center, the network latency is generally below 1ms (in fact, it is usually between 50 microseconds and 200 microseconds on average). As long as you are not passing through a firewall and the network bandwidth is 10Gbps, you do not need to worry about this issue, as it should be sufficient.

  3. In a storage-compute separation setup, network issues are one aspect of data interaction, and the database’s RPC handling capability is another. Even with sufficient network bandwidth, slow processing can still occur due to RPC handling mechanisms. For frequent data requests that cannot be pushed down to TiKV for filtering (such as full table scans or large hash joins that require moving data to compute nodes for processing), it is strongly recommended to deploy TiDB servers separately to avoid interference with online transactions.

| username: 有猫万事足 | Original post link

The main issue is that you don’t use Grafana; otherwise, this graph would already be available. In Grafana, there is a Blackbox_exporter. This includes the latency between any server and other servers within the cluster.

| username: Jellybean | Original post link

Best practice experience from the expert

| username: chenhanneu | Original post link

Is this the server pinging others or other servers pinging this one for latency?

| username: 随缘天空 | Original post link

Typically within a few milliseconds, network packet loss can increase network bandwidth. Use 10 Gigabit Ethernet to minimize it as much as possible.

| username: 像风一样的男子 | Original post link

Latency within the same data center is less than 0.5ms, within the same city across multiple data centers is less than 2ms, and across different cities is less than 5ms, for reference only.

| username: residentevil | Original post link

This reply is very professional :+1:

| username: 有猫万事足 | Original post link

Both.

You can select the host to ping from above.
The chart shows the latency to each machine within the cluster.
So it is mesh and bidirectional. You can observe the ping values from any host to other machines within the cluster.

My TiFlash and TiKV are not in the same subnet. You can see there is some latency.

| username: residentevil | Original post link

Which monitoring metric is this in? Is it available in version v6.1.7?

| username: 有猫万事足 | Original post link

First, you need to have the blackbox_exporter process running on each machine.

Then, look for it in the targets.

Once you find it, click on it to open.

| username: residentevil | Original post link

Found it. With this monitoring, we can identify the network latency issue.

| username: 人如其名 | Original post link

It should be considered in conjunction with network traffic. Using ping to measure network latency can only give a rough estimate, as ping has a relatively low priority and may not be very accurate. The best approach is to use db.ping, which operates over the TCP protocol with a higher priority and is more accurate, although it is not available on the monitoring panel. During high network traffic, ping latency might be higher, but the actual delay may not be as significant, so a comprehensive analysis is needed.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.