High CPU Load on Single TiKV Node

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiKV单结点CPU负载高

| username: TiDBer_zarFUlCo

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.1.6
[Reproduction Path]
[Situation:
a. One of the three TiKV nodes has high CPU metrics since the project started. This phenomenon was magnified several times on June 26.
b. The project is performing large-scale data writes. Currently, the data in three tables are 2.2 billion, 1 billion, and 700 million respectively. Sharding optimization has been executed, but write hotspots still occur, and the write speed is inconsistent.
c. The regions and leaders of the three TiKV nodes are evenly distributed.
d. The three PD nodes have not performed load balancing. An attempt to load balance resulted in a further decrease in write speed.
e. All nodes are on the same physical machine.
]
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page


[Attachments: Screenshots/Logs/Monitoring]


| username: 有猫万事足 | Original post link

Suspect it is a write hotspot.
Directly go to the TiDB dashboard and select Traffic Visualization. Check the write byte volume graph for the corresponding time period to see what it looks like.

The number of leaders and regions on the three TiKV nodes being balanced does not mean that the regions and leaders of a single table are balanced on the three TiKV nodes. So these graphs still make it difficult to rule out a write hotspot.

You can also explain how the scatter optimization is executed.

| username: tidb菜鸟一只 | Original post link

Take a look at the Statistics - hot write monitoring on Grafana.

| username: TiDBer_zarFUlCo | Original post link

The write hotspot should always be there, but starting from around 4 PM on June 26, the CPU load on the 192.168.10.9 machine increased from 10-20% higher than the other two nodes to 50-100% higher.

| username: TiDBer_zarFUlCo | Original post link

The write hotspot has existed since the project started, but within two weeks after the launch, the CPU load on the highest node was only 10-20% higher than the other two nodes. After June 26, it has now increased by 100%.

The table ID is generated using UUID.
The table contains timestamp and int type indexes.
The table creation statement includes SHARD_ROW_ID_BITS=4 and PRE_SPLIT_REGIONS=4.
After creating the table, executed ALTER TABLE {sheet} ATTRIBUTES ‘merge_option=deny’.

In April of this year, an unoptimized execution was performed, with a data write volume comparable to now, but the write efficiency was less than half of what it is now.

| username: tidb菜鸟一只 | Original post link

Extend the Statistics - balance time a bit and check if there is an issue with the balance of 10.9 starting around June 27th.

| username: zhanggame1 | Original post link

The write performance consumption of three TiKV with three replicas should be similar. I think there might be a read hotspot. Have you enabled follower read? If not, you can give it a try.

| username: 有猫万事足 | Original post link

SHARD_ROW_ID_BITS=4 PRE_SPLIT_REGIONS=4

The number of pre-splits you have might still be a bit conservative for the amount of data you are importing.
For my single table with over 200 million records, I used
SHARD_ROW_ID_BITS=5 PRE_SPLIT_REGIONS=5
When importing this table with DM, the 4 TiKV nodes were balanced, and the CPU usage was almost fully utilized.

SHARD_ROW_ID_BITS=4 PRE_SPLIT_REGIONS=4
I haven’t experimented with this combination.
However, I can confirm that when it was set to
SHARD_ROW_ID_BITS=3 PRE_SPLIT_REGIONS=2
one of the 4 TiKV nodes had a CPU usage of only 10-20, completely just watching the other 3 TiKV nodes.
Your single table data volume is much larger than mine, and your performance is better than mine. I think you can continue to increase the number of pre-splits. Observe whether the TiKV CPUs can be fully utilized together.

| username: liuis | Original post link

Is there a hot node?

| username: h5n1 | Original post link

Use pd-ctl store weight xxx xxx xx to slightly lower the leader weight of the TiKV with high CPU usage, and then observe the balance situation.

| username: TiDBer_zarFUlCo | Original post link

The image is not visible. Please provide the text you need translated.

| username: TiDBer_zarFUlCo | Original post link

Is there an impact even if there is no related business being read right now?

| username: TiDBer_zarFUlCo | Original post link

Just stopped all external database operations, but this node is still running under a high load state, with a large number of internal SQL operations in the slow query statements.

| username: tidb菜鸟一只 | Original post link

Use SHOW ANALYZE STATUS to check if statistics are being collected. You can pause the automatic statistics collection tasks for the corresponding three large tables.

| username: redgame | Original post link

Consider using partitioned tables or adjusting the data distribution strategy.

| username: TiDBer_zarFUlCo | Original post link

I tried increasing SHARD_ROW_ID_BITS and left it overnight. By using show table {sheet} regions, I observed that the number of regions for the table on each node changed, but the issue still persists.
Changes in regions for the table with 2.2 billion records:
192.168.10.9 (high load) 14681 → 11158
192.168.10.10 12100 → 13737
192.168.10.11 11976 → 13996

| username: TiDBer_zarFUlCo | Original post link

It doesn’t seem like it.

| username: tidb菜鸟一只 | Original post link

Yesterday at 4 PM, wasn’t the 2.2 billion table executing automatic statistics collection? There should have been a lot of internal statistics-related SQL at that time, right? The three large tables you mentioned with 2.2 billion, 1 billion, and 700 million rows, did they receive a lot of data in the past few days? You might consider pausing the automatic statistics collection for these three tables first and then collect the statistics uniformly after the data has been fully inserted.

| username: zhouzeru | Original post link

  1. Insufficient hardware resources: Since all nodes are on the same physical machine, it may lead to insufficient hardware resources, such as CPU, memory, disk, etc. If hardware resources are insufficient, it may cause high CPU usage and fluctuating write speeds in TiKV. You can check the hardware resource usage of the physical machine to determine if there is a resource shortage issue.
  2. Write hotspot: You mentioned the presence of write hotspots, which may cause some TiKV nodes to have high loads, leading to high CPU usage and fluctuating write speeds. You can try using TiDB’s automatic partitioning or manual partitioning features to evenly distribute hotspot data across different TiKV nodes to alleviate load pressure.
  3. Network latency: Since all nodes are on the same physical machine, it may lead to high network latency, affecting TiKV’s performance and throughput. You can try distributing TiKV nodes across different physical machines to reduce network latency.
  4. Improper PD load balancing: You mentioned that you tried PD load balancing, but it resulted in further decreased write speeds. This may be due to improper load balancing strategies, causing some TiKV nodes to be overloaded. You can retry PD load balancing and adjust the load balancing strategy to balance the load on TiKV nodes.
| username: h5n1 | Original post link

Try using pd-ctl store weight xxx xxx xx to gradually lower the weight of the TiKV leader with high CPU usage, and then check the balance situation.