High CPU Load on TiKV After Compaction

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: compact后tikv的cpu负载过高

| username: ks_ops_ms

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.4.0
[Reproduction Path] Manually compact after deleting a large amount of data from the table
[Encountered Problem] After manual compact, some disk space was indeed freed, but the overall system load remained around 85 for a long time, causing slow response and affecting business operations. After manually stopping, it was found that the compact command had become a zombie process, and there was no significant change in CPU load.
[Resource Configuration]


tikv.log (6.8 MB)


[Attachments: Screenshots/Logs/Monitoring]

| username: h5n1 | Original post link

The compaction hasn’t stopped yet; it hasn’t finished running.

| username: 像风一样的男子 | Original post link

Have all the KVs lost connection?

| username: h5n1 | Original post link

Try lowering the rocksdb.rate-bytes-per-sec parameter.

| username: tidb菜鸟一只 | Original post link

Directly execute compact on all TiKV?

| username: redgame | Original post link

It will pass soon, it won’t always be like this.

| username: ks_ops_ms | Original post link

I ran a compact on each TiKV node. Here are the commands executed:

nohup ./tikv-ctl --host xxx compact -d kv -c write &
nohup ./tikv-ctl --host xxx compact -d kv -c default &
nohup ./tikv-ctl --host xxx compact -d kv -c lock --bottommost force &
nohup ./tikv-ctl --host xxx compact -d kv -c write --bottommost force &
| username: ks_ops_ms | Original post link

Up until now, since 2 PM yesterday, there are still two TiKV CPUs maintaining at over eighty. However, one can already be seen on the dashboard.

| username: ks_ops_ms | Original post link

One has already been restored, but there are still two that cannot be seen on the dashboard.

| username: tidb菜鸟一只 | Original post link

Compacting TiKV data consumes a lot of IO and CPU. It is recommended to do it one node at a time and during off-peak business hours. You should also consider the impact on business and set the --threads parameter accordingly. Doing it all at once, especially at 2 PM, will likely cause issues, which are probably being resolved right now…

| username: ks_ops_ms | Original post link

This originally started at midnight and continued until 2 PM the next day without finishing. The CPU usage was too high for a long time, so I had to manually kill it. However, it didn’t seem to stop, and the CPU load was still very high. I looked through the documentation and other posts in the forum but didn’t find a similar thorough solution.

| username: h5n1 | Original post link

The images you provided are not accessible for translation. Please provide the text content directly for translation.

| username: ks_ops_ms | Original post link

I just checked this value of tdb

| username: ks_ops_ms | Original post link

What is the significance of this parameter?
In the end, due to the extremely high CPU usage and the presence of zombie processes causing constant alerts, we finally restarted the system on a single TiKV node, and now it has recovered.

| username: tidb菜鸟一只 | Original post link

The tikv-ctl command won’t be effective after starting; you’ll need to restart TiKV.

| username: ks_ops_ms | Original post link

Yes, in the end, a single TiKV node was restarted, and after the restart, the system gradually returned to normal.

| username: h5n1 | Original post link

Speed limit, disk read/write speed

| username: ks_ops_ms | Original post link

Since the system has now recovered, it is inconvenient to reproduce the scenario at that time. The effect of this parameter on compact is not very clear now. If there is a next time to delete data, you can try it. Another question, does this parameter need to be restarted after setting?

| username: h5n1 | Original post link

Online settings

| username: ks_ops_ms | Original post link

Understood :saluting_face:, thank you :pray: