Abnormally High Network Traffic Between TiKV Nodes

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv 节点间网络流量异常的大

| username: starsky

[TiDB Usage Environment] Production Environment
[TiDB Version] V6.1.5
[Reproduction Path] This phenomenon has always been present
[Encountered Problem: Problem Phenomenon and Impact]
From the monitoring, the inbound network traffic of the TiKV node can reach a maximum of 20.91MB/s and a minimum of 4.74MB/s, with nearly 300G+ in a day. However, the actual storage of this TiKV only increased by 16G, indicating abnormally high traffic. I couldn’t find any anomalies in the scheduling tasks. Does anyone know how to further investigate this?



[dmall_rdp_voucher-PD - Grafana (2023_11_15 16_03_34).html|attachment]

| username: tidb菜鸟一只 | Original post link

Check the scheduling status of the regions to see if there are any large-scale region migrations.

| username: h5n1 | Original post link

Check the coprocessor executor count on the overview or tikv-detail page. If there is a corresponding increase at that time, it means that there is an SQL execution with a large amount of data transmission.

| username: Jellybean | Original post link

How is the QPS for business access? What about large SQL and large queries?

If there are large queries, a large amount of data will also be read from TiKV. In this case, it is necessary to look at the business read and write access together.

| username: Fly-bird | Original post link

Troubleshoot based on the business. If there’s no business and it’s still this large, then that’s what needs to be addressed.

| username: starsky | Original post link

The attached HTML file is the monitoring page related to region scheduling. It doesn’t seem high. Could you please take another look?

| username: starsky | Original post link

If it is traffic from business access, then it should be between TiDB and TiKV, and TiKV should also have high outgoing traffic. But the monitoring shows that TiKV has high incoming traffic, which is a bit confusing.

| username: starsky | Original post link

I checked, and it was also quite low before 8 o’clock, but the traffic was not small at that time either.

| username: starsky | Original post link

There are business operations, so it’s not easy to stop everything to check. Otherwise, I would really want to stop and take a look. :grinning:

| username: h5n1 | Original post link

Check the pd.log on the pd leader and find the amount of add peer for each store by time and store, roughly as follows:

grep 'operator finish' pd.log | grep 'add peer' | awk '{print $1, $2, $14}' | sed -e 's/:[0-9][0-9].[0-9][0-9][0-9]//g' | sort | uniq -c
| username: starsky | Original post link

I checked, there are no add peer related operations. On 11-06, there are only these:

At the same time, I checked all operator=xxx types on 11-06, and there are only these. grep '2023\/11\/06' pd-2023-11-08T01-15-31.188.log | egrep 'operator="\\"[a-z-]{1,}' |awk '{print $6,$7,$9,$10}' | sed 's/\[takes=.*s\]//g' | sort | uniq

| username: h5n1 | Original post link

Is your high traffic referring to a single day being higher than usual? If that’s the case, it might be caused by Leader balancing.

Your 4-20M traffic isn’t considered large, right?

| username: 舞动梦灵 | Original post link

You can refer to my previous experience: However, the impact of this issue seems minimal. You can refer to the usual daily usage over the past 7 days or a month.

Alibaba Cloud monitoring alerts show that the usual traffic is very low, only a few megabytes. However, it frequently alerts between 400M and 700M. After having the system operations team investigate, it was found that certain machines were transmitting data to each other. Specifically, PD was sending a few kilobytes to megabytes to KV, while KV to PD was in the hundreds of megabytes. I suspected it was an SQL issue, so I checked the slow query and SQL traffic analysis in the dashboard. There was one SQL query that kept appearing. By using show processlist and grepping for queries, I found that this SQL query was always present. I then asked who was responsible for this SQL query and had them stop it. It turned out that the SQL query was stuck in an infinite loop due to an additional piece of data in a particular table, causing frequent cyclic queries.

| username: starsky | Original post link

It’s always been this large. This TiDB setup only has replace into/select operations, with no other business operations. The TiKV storage nodes increase by about 15G per day, but the traffic is this large, which doesn’t seem quite normal.

| username: heiwandou | Original post link

Check if the QPS has increased due to the load.

| username: starsky | Original post link

For SQL queries, they normally go through the leader TiKV node, right? Is there a lot of data interaction between TiKV nodes?

| username: h5n1 | Original post link

Analyzing network anomalies through storage growth seems a bit unreasonable. TiKV has compaction that releases space.

The traffic generated by TiKV is mostly during region/leader transfers when sending region snapshots.

| username: starsky | Original post link

Before 8 o’clock, there were very few monitoring operations, but the traffic was still around 5M/s.

| username: starsky | Original post link

It is not very reasonable to analyze network traffic based on storage growth, but it can be used as a reference. Currently, the difference between the two can be around 20 times, which is a bit outrageous. Or is there any other method to analyze it more reasonably?

| username: h5n1 | Original post link

This means that after 8 o’clock, the business volume starts to increase, and the region performs load balancing.