Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.Original topic: select for update 导致的tikv cpu load 高

[TiDB Usage Environment] Production Environment
[TiDB Version] v5.4.3
[Reproduction Path]
In the business code, start a transaction:
First, use select for update to get the specified value, and then update the data.
This type of transaction will have hundreds of concurrent executions.
CREATE TABLE `tb_task` (
`task_id` varchar(64) NOT NULL COMMENT 'Task ID',
`type` tinyint(3) unsigned NOT NULL COMMENT 'Task Type',
`status` tinyint(3) unsigned NOT NULL COMMENT 'Task Status',
`trigger_type` tinyint(3) unsigned NOT NULL COMMENT 'Task Trigger Type online/offline',
`size` bigint(20) NOT NULL COMMENT 'Size',
`zos_file` text NOT NULL COMMENT 'Template Download Address',
`params` text NOT NULL COMMENT 'Request Content',
`expect_time` datetime NOT NULL COMMENT 'Expected Conversion Time',
`start_time` datetime NOT NULL COMMENT 'Conversion Start Time',
`end_time` datetime NOT NULL COMMENT 'Conversion End Time',
`create_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'Insertion Time',
`update_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT 'Update Time',
PRIMARY KEY (`task_id`) /*T![clustered_index] NONCLUSTERED */,
KEY `idx_status` (`status`),
KEY `idx_trigger_status` (`trigger_type`,`expect_time`,`status`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin /*T! SHARD_ROW_ID_BITS=4 */
SELECT * FROM `tb_task` WHERE status IN (0) AND `trigger_type` = 2 ORDER BY expect_time asc LIMIT 1 FOR UPDATE;
[Encountered Problem]
A certain TiKV node has a very high CPU load.
[Resource Configuration]
384G 96C
[Attachments: Screenshots/Logs/Monitoring]
I changed the retry parameter for write-write conflicts to 1: pessimistic-txn.max-retry-count 1
tidb.log
[2022/12/27 16:13:58.094 +08:00] [INFO] [conn.go:1121] ["command dispatched failed"] [conn=7231] [connInfo="id:7231, addr:10.96.13.165:34360 status:11, collation:utf8mb4_general_ci, user:officeser_app"] [command=Execute] [status="inTxn:1, autocommit:1"] [sql="SELECT * FROM `tb_task` WHERE status IN (?) AND `trigger_type` = ? ORDER BY expect_time asc LIMIT 1 FOR UPDATE [arguments: (0, 2)]"] [txn_mode=PESSIMISTIC] [err="pessimistic lock retry limit reached
tikv.log
[2022/12/27 16:53:12.092 +08:00] [WARN] [endpoint.rs:606] [error-response] [err="Key is locked (will clean up) primary_lock: 7480000000000000E95F7280000000000B4D4B lock_version: 438339158937698344 key: 7480000000000000E95F6980000000000000030400000000000000020419AEB55E670000000400000000000000000380000000000B4D4B lock_ttl: 3719 txn_size: 2 lock_type: Del lock_for_update_ts: 438339159121461437 min_commit_ts: 438339159121461438"]
Questions:
-
Why does this operation cause such a high CPU load on the TiKV node?
-
If it is due to write-write conflict retries, the related metrics do not seem to match the description in the article.
-
I see an unusually high index operation volume on the TiKV dashboard. How should I interpret this metric?
Can the above index operations correspond to any of the steps in the diagram below?