Uneven Distribution of Hot Read Region Leaders Causes Severe Skew in TiKV CPU Load

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: Hot Read Region Leader分布不均匀,导致TiKV CPU Load负载严重倾斜

| username: robert233

[TiDB Usage Environment]

  • Production Environment

[TiDB Version]

  • v4.0.12

[Encountered Issues]

  • Hot Read Region Leader distribution is uneven, causing severe skew in TiKV CPU Load.

  • Monitoring as follows:

  • Adjusted parameters:
    hot-region-schedule-limit": 4
    hot-region-cache-hits-threshold": 3

  • How to adjust
    How to migrate hot read region to other instances?

| username: ddhe9527 | Original post link

Adjust the two configurations of Load Base Split to be smaller, and increase the hot-region-schedule-limit.

| username: robert233 | Original post link

Sure, I’ll give it a try.

| username: robert233 | Original post link

  1. The overall QPS is not very high, as shown below

  2. According to the above method:
    hot-region-schedule-limit: 4hot-region-schedule-limit: 10
    split.qps-threshold: 3000split.qps-threshold: 400

From the hot region, there is no transfer read leader

| username: h5n1 | Original post link

Check the CPU utilization of the coprocessor and read pool in the TiKV detail - thread CPU. Also, check for slow SQL.

| username: h5n1 | Original post link

Methods to address hotspot issues:

  1. Use pd-ctl operator to manually transfer hot regions to less busy nodes. Hot regions can be viewed through pd-ctl hot region or tikv_hot_region (approximately these two names) to check the top read regions.
  2. Use pd-ctl operator to manually split hot regions and then wait for scheduling. Splitting should be effective for hotspots concentrated on certain keys.
  3. Check if the leader distribution is even and whether any leader weight settings are causing uneven distribution.
  4. Tables can add scatter scheduling to evenly distribute all regions across all nodes, suitable for scenarios with many hot regions. However, sometimes scatter scheduling may not work well.
  5. Use shuffle leader scheduling to randomly swap leaders, but turn it off immediately after enabling it; otherwise, random leader scheduling can severely impact performance.
  6. Adjust table structure using auto_random, shard_rowid_bits, hash partitioning, and other methods.
| username: xiaohetao | Original post link

It feels like this is the universal procedure for handling hot topics.

| username: alfred | Original post link

This step or plan is indeed a universal step :+1:

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.