High TiKV CPU Load Caused by SELECT FOR UPDATE

translator_bot · June 22, 2024, 8:54pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: select for update 导致的tikv cpu load 高

| username: 是我的海

[TiDB Usage Environment] Production Environment
[TiDB Version] v5.4.3
[Reproduction Path]
In the business code, start a transaction:
First, use select for update to get the specified value, and then update the data.
This type of transaction will have hundreds of concurrent executions.

CREATE TABLE `tb_task` (
  `task_id` varchar(64) NOT NULL COMMENT 'Task ID',
  `type` tinyint(3) unsigned NOT NULL COMMENT 'Task Type',
  `status` tinyint(3) unsigned NOT NULL COMMENT 'Task Status',
  `trigger_type` tinyint(3) unsigned NOT NULL COMMENT 'Task Trigger Type online/offline',
  `size` bigint(20) NOT NULL COMMENT 'Size',
  `zos_file` text NOT NULL COMMENT 'Template Download Address',
  `params` text NOT NULL COMMENT 'Request Content',
  `expect_time` datetime NOT NULL COMMENT 'Expected Conversion Time',
  `start_time` datetime NOT NULL COMMENT 'Conversion Start Time',
  `end_time` datetime NOT NULL COMMENT 'Conversion End Time',
  `create_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'Insertion Time',
  `update_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT 'Update Time',
  PRIMARY KEY (`task_id`) /*T![clustered_index] NONCLUSTERED */,
  KEY `idx_status` (`status`),
  KEY `idx_trigger_status` (`trigger_type`,`expect_time`,`status`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin /*T! SHARD_ROW_ID_BITS=4 */

SELECT * FROM `tb_task` WHERE status IN (0) AND `trigger_type` = 2 ORDER BY expect_time asc LIMIT 1 FOR UPDATE;

[Encountered Problem]
A certain TiKV node has a very high CPU load.

[Resource Configuration]
384G 96C
[Attachments: Screenshots/Logs/Monitoring]

I changed the retry parameter for write-write conflicts to 1: pessimistic-txn.max-retry-count 1
tidb.log

[2022/12/27 16:13:58.094 +08:00] [INFO] [conn.go:1121] ["command dispatched failed"] [conn=7231] [connInfo="id:7231, addr:10.96.13.165:34360 status:11, collation:utf8mb4_general_ci, user:officeser_app"] [command=Execute] [status="inTxn:1, autocommit:1"] [sql="SELECT * FROM `tb_task` WHERE status IN (?) AND `trigger_type` = ? ORDER BY expect_time asc LIMIT 1 FOR UPDATE [arguments: (0, 2)]"] [txn_mode=PESSIMISTIC] [err="pessimistic lock retry limit reached

tikv.log

[2022/12/27 16:53:12.092 +08:00] [WARN] [endpoint.rs:606] [error-response] [err="Key is locked (will clean up) primary_lock: 7480000000000000E95F7280000000000B4D4B lock_version: 438339158937698344 key: 7480000000000000E95F6980000000000000030400000000000000020419AEB55E670000000400000000000000000380000000000B4D4B lock_ttl: 3719 txn_size: 2 lock_type: Del lock_for_update_ts: 438339159121461437 min_commit_ts: 438339159121461438"]

Questions:

Why does this operation cause such a high CPU load on the TiKV node?
If it is due to write-write conflict retries, the related metrics do not seem to match the description in the article.

Image1380×474 47.2 KB
I see an unusually high index operation volume on the TiKV dashboard. How should I interpret this metric?

Image1380×252 49.8 KB

Can the above index operations correspond to any of the steps in the diagram below?

translator_bot · June 22, 2024, 8:54pm

| username: xfworld | Original post link

There are two layers of issues here:

Is there a data hotspot? Then, does the setting of the index also bring about a hotspot index?
“for update” is essentially using a lock to fetch data. For the same piece of data, other concurrent operations within this time period will fail. You can check if there is this concurrency issue.

PS: The hotspot issue can be observed through the traffic visualization on the dashboard…

translator_bot · June 22, 2024, 8:54pm

| username: 裤衩儿飞上天 | Original post link

As long as there is a select for update, it will lock the selected data and block subsequent writes. During high concurrency, there will definitely be waits or failures. Just use a direct select to get the data.

translator_bot · June 22, 2024, 8:54pm

| username: WalterWj | Original post link

The CPU usage appears to have a hotspot. You can check out PingCAP’s articles on hotspots.

translator_bot · June 22, 2024, 8:54pm

| username: ohammer | Original post link

I agree. You can check the process list to see if there are a large number of sessions waiting for locks.