High CPU Load on a Single TiKV Instance

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 单个tikv的cpu负载很高

| username: zhimadi

【TiDB Usage Environment】Production environment
【TiDB Version】v4.0.9
【Encountered Problem】The cluster has 3 TiKV nodes. Through the monitoring panel, it is often seen that the CPU load of a single TiKV is higher than the other two, and the IO of a single TiKV is always at 100%.
【Reproduction Path】Operations performed that led to the problem
【Problem Phenomenon and Impact】
Causes system instability, and a single slow SQL can easily trigger an avalanche event. Recently, one of the TiKV’s CPU usage exceeded 90%, causing user system lag and generating a lot of dirty data. Attempting to restart a single TiKV node was ineffective. The entire cluster needed to be restarted to resume production.
【Attachments】

Please provide the version information of each component, such as cdc/tikv, which can be obtained by executing cdc version/tikv-server --version.

| username: xfworld | Original post link

Is the number of regions held by the TiKV node with high load higher than the other two?

| username: zhimadi | Original post link

May I ask where to check the region metrics held by the TiKV nodes? If it’s in overview->tikv, the regions for all three machines are the same, max: 11.0K; current: 11.0K.

| username: xfworld | Original post link

You can check the TiKV metrics through Prometheus.

Alternatively, you can use the PD-CLI command line to view them.

The main question is: Are the resources balanced? If not, is it caused by a hotspot issue?

| username: xfworld | Original post link

Then you can search on asktug for solutions and paths to hotspot issues, and check accordingly.

| username: tidb狂热爱好者 | Original post link

Check the slow SQL, this is definitely related to slow SQL.

| username: zhimadi | Original post link

Okay. Should we refer to these two official documents?

We migrated from MySQL, and all table primary keys are of integer type, using AUTO_INCREMENT. Does this have a significant impact?

| username: tidb狂热爱好者 | Original post link

| username: forever | Original post link

Check the flame graph for any hotspots.

| username: tidb狂热爱好者 | Original post link

  1. Check the slow SQL.
  2. Reject the execution of all slow SQL.
  3. Upgrade the configuration.
set @@global.MAX_EXECUTION_TIME=10000
| username: zhimadi | Original post link

Yes. When the incident occurred last Sunday, I immediately checked the slow SQL, added indexes to the suspected queries, and made other optimizations. Then I doubled the CPU and memory for TiDB. MAX_EXECUTION_TIME is set at the session level in the application because analyze and adding indexes take more than 10 seconds. However, after observing for the past few days, it seems that the problem still exists and might be triggered again during the next peak period.

| username: xfworld | Original post link

It is related. If you have high-performance requirements, it is recommended to consider using clustered indexes.

Clustered indexes are a solution that came later, and the version you are currently using may not support it yet, but you can consider simulating it.

  1. Single primary key column, and the type should be bigint.
  2. The primary key should be of the auto_random type.

If it is inconvenient to modify, you can scatter the current regions to reduce hotspot issues.

If the data volume is very large, it is still recommended to use clustered indexes… save some brain cells…

| username: wakaka | Original post link

Are there still many slow SQL queries after optimization? I see your CPU load and IO utilization are both very high.

| username: tidb狂热爱好者 | Original post link

Causing system instability, a single slow SQL can easily trigger an avalanche event. Recently, one of the TiKV nodes had a CPU usage of over 90%, causing system lag for users and resulting in a lot of dirty data. Attempting to restart a single TiKV node was ineffective. The entire cluster had to be restarted to resume production.

Your method is incorrect and can easily lead to accidents.

| username: zhimadi | Original post link

Added one KV, now there are 4. The ioutil has decreased and is no longer hitting 100% from time to time like before.

| username: zhimadi | Original post link

How to scatter regions; how to use clustered indexes? Is there a tutorial? Thank you.

| username: xiaohetao | Original post link

Here are the instructions and operation examples: Split Region 使用文档 | PingCAP 文档中心

| username: xiaohetao | Original post link

MAX_EXECUTION_TIME Just looked at the description of this parameter, it applies to all types of SQL, which can be very dangerous if used unreasonably.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.