A sudden spike in read traffic on a TiKV node

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIKV的某个节点读流量瞬间飙升

| username: return_sl

[TiDB Usage Environment] Production Environment
[TiDB Version] 5.7.25-TiDB-v5.2.4
[Reproduction Path] Operations performed that led to the issue
[Encountered Issue: Symptoms and Impact]
The TiDB deployment nodes are on Alibaba Cloud servers, with 2 TiDB nodes, 3 TiKV nodes, and one TiFlash. Suddenly, the read traffic on one of the TiKV nodes spiked to 300M. This has happened several times in the past few days, and it remains at 300M unless the machine is restarted. However, we have not identified any related large SQL queries in our business operations. The table statistics are all above 80%. Please provide guidance on troubleshooting steps.

[Resource Configuration] Navigate to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

| username: 像风一样的男子 | Original post link

This is a typical hotspot read issue. You can check which SQL statements consumed the most resources during that time period in the topsql section of the dashboard.

| username: return_sl | Original post link

The key issue is that there are no related problematic SQL queries, all queries are within 10 seconds, and memory usage does not exceed 1GB.

| username: madcoder | Original post link

Then just use iotop on the machine to see which process is occupying high resources.

| username: zhaokede | Original post link

Check if the business is concentrated on a node, forming a hotspot.
The hotspot needs to be dispersed.

| username: tidb菜鸟一只 | Original post link

Let’s see a screenshot of the dashboard traffic visualization interface.

| username: DBAER | Original post link

Check the traffic visualization on the dashboard.

| username: DBAER | Original post link

You can also take a look at this view

select * from information_schema.TIDB_HOT_REGIONS
| username: Jellybean | Original post link

The spike in TiKV’s CPU and cloud disk read/write BPS indicates that a large amount of data is being accessed. To solve the problem, you need to first figure out who is accessing this data. The troubleshooting ideas can be as follows:

  1. First, check the Dashboard panel for topSQL, slow queries, SQL statement analysis, heatmaps, etc., to get an overall understanding of the cluster’s running status.
  2. During the period when the TiKV node’s resource usage is high, confirm the business access situation, check QPS and latency, and verify if there are changes in business access volume, number of access links, load, etc.
  3. If TiKV maintains high access for a long time and it is due to SQL access, there will definitely be slow SQL queries. You can try to find the corresponding slow SQL through the Dashboard or slow query logs and then analyze them.
  4. If there is no change in business load and no slow SQL, start focusing on the TiKV Grafana monitoring panel to analyze whether the issue is triggered by the cluster’s internal scheduling mechanism. Focus on analyzing TiKV’s GC operations, RocksDB Compaction, etc.
  5. Investigate the underlying cloud host’s basic environment issues.

You can try these troubleshooting ideas first.

| username: return_sl | Original post link

Traffic Visualization View

| username: DBAER | Original post link

Adjust the color to view the deployment with brighter colors and wider areas.

| username: return_sl | Original post link

In the above image, the two yellow dots at the top and bottom should be the hottest spots. Let me take a look at these tables first.

| username: 路在何chu | Original post link

This is likely because most of the regions of the hotspot table are on this TiKV node. You can check the statistics yourself.

| username: zhanggame1 | Original post link

The brighter the color, the hotter it is. You can click.

| username: 健康的腰间盘 | Original post link

Scale up!

| username: kkpeter | Original post link

Check the hotspots in the dashboard.

| username: yytest | Original post link

You need to check the underlying logs.

| username: zhh_912 | Original post link

Take a look at GC