Cluster Monitoring and Analysis

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 集群监控分析

| username: 随缘天空

[TiDB Usage Environment] Production Environment
[TiDB Version] V7.1.1
[Reproduction Path] When viewing the dashboard overview menu, it was found that the read performance of the three TiKV nodes in the IO monitoring chart was significantly different. Is this caused by read hotspots? Is the load balancing not functioning as shown in the picture? It seems like one machine is not working, and the load difference between the other two is quite large.
[Encountered Problem: Problem Phenomenon and Impact]
[Resource Configuration]
[Attachment: Screenshot/Log/Monitoring]

| username: 裤衩儿飞上天 | Original post link

  1. Is the data volume of the entire cluster large or small? Check if the leader distribution is balanced. By default, TiDB performs read and write operations on the leader.
  2. If you suspect a hotspot issue, then post the heatmap. However, with your volume (read average is only 20K), it might not be reflected on the heatmap.
| username: 随缘天空 | Original post link

The data volume is not large, with only about 360,000 rows across all tables.


| username: 裤衩儿飞上天 | Original post link

  1. The entire cluster has only 11 regions, with the leader distribution being 5, 5, and 1. After excluding some regions used by the system, it can be understood that there is not much data in your cluster. You can add more tables and data, as the current amount is too small to be of much reference value (no absolute balance).
  2. The yellow dots on the heatmap indicate hot reads, but the data volume is relatively small, so it doesn’t have much reference significance. You can run tests for a period of time, such as ten minutes, half an hour, or an hour, which will make it more apparent.
    For specific handling methods, you can refer to: TiDB Hot Spot Issue Handling | PingCAP Documentation Center
| username: zhanggame1 | Original post link

For data at the KB level, don’t bother looking at the monitoring. Run a stress test and observe simultaneously.

| username: tidb菜鸟一只 | Original post link

You only have one leader in this TiKV, so there definitely won’t be much IO usage. There’s no data on it. Generate more data and then query it, and you’ll see IO usage.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.