Five TiKV Instances Frequently Experience Very High CPU Load Randomly

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 5个tikv经常随机出现cpu负载非常高

| username: beacoolkid

[TiDB Usage Environment] Production Environment
[TiDB Version]
[Reproduction Path] Operations performed that led to the issue
[Encountered Issue: Problem Phenomenon and Impact]
Frequent occurrences of very high CPU load, and TiKV nodes randomly experiencing high load, with only one TiKV affected each time, causing the cluster to slow down in read and write operations.
Servers are 28 cores, 128G

[Resource Configuration]


[Attachments: Screenshots/Logs/Monitoring]

| username: tidb菜鸟一只 | Original post link

If the CPU load of a particular TiKV is high, first check if there is a hotspot…

| username: xfworld | Original post link

It’s fully loaded, three nodes, the other two nodes are relatively idle…

| username: TiDBer_小阿飞 | Original post link

  1. Check the heatmap to see which table or index is causing the issue.
  2. Use pd-ctl to check:
    region topread 20
    region topwrite 20
    
    Or:
    hot read
    hot write
    
    Based on the results of region hotread, identify the regions with high load on the leader store id, and then use the following methods to check the corresponding table and index for the region:
    Method 1:
    select * from information_schema.TIKV_region_status where region_id in (123,456,789);
    
    Method 2:
    curl http://{TiDBIP}:10080/regions/{RegionId}
    
  3. If the issue is a read hotspot and the table is identified, it is relatively easy to resolve:
    1. Slow query:
      Optimize the slow query. If it is inconvenient to change the SQL, you can use execution plan binding.
    2. Utilize small table read hotspot:
      Load Base Split
| username: beacoolkid | Original post link

The hotspots should be very balanced.

| username: beacoolkid | Original post link

Checked, the hotspots are well balanced.

| username: Billmay表妹 | Original post link

Is there a mixed deployment?

| username: beacoolkid | Original post link

Moreover, according to the monitoring, the nodes with hotspots do not have high loads.

| username: beacoolkid | Original post link

There are no independent TiKV nodes, it just happened suddenly.

| username: TiDBer_小阿飞 | Original post link

Are there any errors or anomalies in the TiKV node logs?

| username: Billmay表妹 | Original post link

Use this to collect logs, and then add viewing permissions for the friends who help answer your questions, so that those who help you can get enough information.

| username: h5n1 | Original post link

Check tikv-detail → thread cpu

| username: beacoolkid | Original post link

| username: beacoolkid | Original post link

Is version 4.0 supported?

| username: tidb菜鸟一只 | Original post link

It looks like some queries are all falling on one TiKV, right…

| username: h5n1 | Original post link

There are read hotspots switching back and forth on TiKV. Check information_schema.tidb_hot_regions or use pd-ctl region hot to see the hot regions.

| username: beacoolkid | Original post link

The store ID with high CPU load is currently 329715201.

| username: h5n1 | Original post link

Determine which table it is based on information_schema.tikv_region_status.

| username: beacoolkid | Original post link

By using topread and topwrite, you can identify many tables.

| username: 路在何chu | Original post link

Are the CPU models the same?