Update operation is too slow, causing high TiKV CPU usage

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: update 操作执行过慢导致tikv cpu使用率超高

| username: TiDBer_E3pRgGAy

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version] 6.5.5
[Reproduction Path] What operations were performed when the issue occurred
When executing a large number of update operations, the execution process results in the dashboard are different from those in the SQL editor, causing the overall TiKV CPU usage to be very high and resulting in lag.
[Encountered Issue: Issue Phenomenon and Impact]
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]



| username: 扬仔_tidb | Original post link

The execution plan seems incorrect. Check if there is “stats:pseudo” in the full information.

| username: TiDBer_E3pRgGAy | Original post link

Did not see this keyword.

| username: TiDB_C罗 | Original post link

Check the execution time of slow queries in the TiDB dashboard and see what the specific consumption is.

| username: TiDBer_E3pRgGAy | Original post link

| username: Fly-bird | Original post link

Is there an issue with the SQL statement? I suggest checking the system resources.

| username: tidb菜鸟一只 | Original post link

It looks like TiKV is too busy. Check the resource status of TiKV.

| username: 像风一样的男子 | Original post link

Try changing update to select and see if it’s faster.

| username: 大飞哥online | Original post link

Check the monitoring, the resource status of TiKV, and the coprocessor CPU monitoring.

| username: h5n1 | Original post link

It’s not optimal to use the taskid index; the selectivity of execution_id or the combination of the two would be the best, right?

| username: 有猫万事足 | Original post link

The coprocessor resources are insufficient. This matches the phenomenon of the CPU running at full capacity, so it’s time to add CPU resources.

If you don’t add resources, you can only check whether the CPU load of all TiKV instances is balanced.
Additionally, there is indeed a bit of an issue with the index. According to the task ID, it finds row 1370, but after filtering based on other conditions, it turns out there are no rows that need modification, or only 1-2 rows actually need modification.

You can see that both prewrite and commit are very fast, totaling just over 30ms. There shouldn’t be much pressure on the write I/O; it’s all on the CPU.

| username: 路在何chu | Original post link

What is the machine configuration and IO situation? Check if the thread CPU in TiKV detail is sufficient.

| username: 路在何chu | Original post link

load average: Take a look at this in top, is it sufficient?

| username: TiDBer_E3pRgGAy | Original post link

The KV machines are three independent 16-core 64GB cloud servers. Every time a large batch of this update statement is executed in the business, the CPU usage of almost all three KV machines soars to over 80%. Now, setting a composite index on task_id + execution_uid has solved this problem. However, there is a very strange issue: the database configuration is exactly the same, and the code is also exactly the same. These problems did not occur in version 6.6.0, and the concurrency in version 6.6.0 was much higher than the current business concurrency. Because there are many fields in the database that store large JSON objects, I am also hesitant to upgrade to version 6.6.0.

| username: 像风一样的男子 | Original post link

6.6 is not an official release and is not recommended for use in a production environment.

| username: TiDBer_E3pRgGAy | Original post link

The CPU is indeed not very useful.

| username: TiDBer_E3pRgGAy | Original post link

However, now with version 6.5.5, many SQL issues have arisen, all due to slow execution causing CPU spikes. These problems never occurred with version 6.6.0. :sob:

| username: 像风一样的男子 | Original post link

If there are issues with SQL, solve the SQL problems. Version 6.5 is still very stable.

| username: TiDBer_E3pRgGAy | Original post link

The single table data is around 13 million, and the video_mention_task column has an index. This update is also done in large batches simultaneously, which makes it very busy in this version. Moreover, the number of rows scanned in the execution plan also seems incorrect.

| username: 像风一样的男子 | Original post link

Regularly analyze this table. The execution plans of the same SQL in two databases have deviations.