Deadlock found when trying to get lock

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: Deadlock found when trying to get lock

| username: Hacker_e70xHNkP

TiDB version 5.1, there is a table A with fields such as user id (int), date (date), game version id (int), batch number (varchar2), amount (decimal), etc. The table has two normal composite indexes (no unique indexes, user id + date and date + game version id). Table A will be queried, added, and updated (no delete operations). In the Flink stream job, user id has been grouped to ensure that the same user id will only go to one slot (user id + date + game version id + batch number, table record is unique), and transaction_isolation is set to ‘READ-COMMITTED’ (default is REPEATABLE READ), but deadlock errors (Deadlock found when trying to get lock) often occur. I don’t understand the cause and how to solve it.
PS: Deadlock detection and retry are performed on the business side, which does not affect the business, but I want to understand the root cause and how to handle it.
PS: In a transaction, there will be 2 statements for one record, one for inserting if not exists, and one for updating. Every 100 records (200 statements) commit a transaction.
insert into A select xx from dual where not exists (select 1 from A where user id=xx and date=xx and game version=xx and batch number=xx);
update A set amount=amount+xx where user id=xx and date=xx and game version=xx and batch number=xx;

| username: xfworld | Original post link

Write conflicts and read-write conflicts can easily lead to deadlocks when entangled together.

It is recommended to use pessimistic transactions.

| username: Hacker_e70xHNkP | Original post link

The default is pessimistic transactions.

| username: h5n1 | Original post link

It is speculated to be related to secondary indexes. Some versions have lock conflict issues caused by ordinary secondary indexes. You can try upgrading to the latest minor version to see if it helps.

| username: Hacker_e70xHNkP | Original post link

Current version
mysql> select version();
±-------------------+
| version() |
±-------------------+
| 5.7.25-TiDB-v5.4.0 |
±-------------------+
1 row in set (0.00 sec)

| username: jansu-dev | Original post link

Hi, could you please write a simplified reproduction (mini reproduce)? This is a bit difficult for me to understand :rofl:

| username: Hacker_e70xHNkP | Original post link

CREATE TABLE user (
user_id bigint NOT NULL COMMENT ‘User ID’,
reg_date DATE NOT NULL COMMENT ‘Registration Date’,
game_version int(8) NOT NULL DEFAULT ‘0’ COMMENT ‘Game Version’,
batch_sn varchar(20) NOT NULL COMMENT ‘Batch Number’,
amount decimal(11,2) DEFAULT 0 COMMENT ‘Recharge Amount’,
…(other fields)
create_time DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT ‘Creation Time’,
modify_time DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT ‘Last Modified Time’,
KEY idx_key_1 (user_id, reg_date),
KEY idx_key_2 (reg_date, game_version),
… (other composite indexes)
) ENGINE=INNODB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin COMMENT=‘User Information’
PARTITION BY RANGE ( TO_DAYS(reg_date) ) (
PARTITION p2021 VALUES LESS THAN (738521),
PARTITION p2022 VALUES LESS THAN (738886),
PARTITION p2023 VALUES LESS THAN (739251),
PARTITION p2024 VALUES LESS THAN (739617),
PARTITION p2025 VALUES LESS THAN (739982),
PARTITION p2026 VALUES LESS THAN (740347),
PARTITION p2027 VALUES LESS THAN (740712),
PARTITION p2028 VALUES LESS THAN (741078),
PARTITION p2029 VALUES LESS THAN (741443),
PARTITION pother VALUES LESS THAN (MAXVALUE)
)

In the table, the combination of user_id + reg_date + game_version + batch_sn is unique (the table does not have a unique index, this is ensured by the program).

In the Flink job, the business process is as follows:
kafka source → group by user_id (group by user, ensuring the same user is in the same slot) → tidb sink

For each piece of business data, the following 2 SQL statements are executed within a transaction:
insert into user(user_id, reg_date, game_version, batch_sn) select xx, xx, xx, xx from dual where not exists(select 1 from user where user_id=xx and reg_date=xx and game_version=xx and batch_sn=xx);
update user set amount=amount+xx where user_id=xx and reg_date=xx and game_version=xx and batch_sn=xx;
Each time, 100 records are processed, corresponding to 200 SQL statements in a single transaction commit.

| username: jansu-dev | Original post link

  1. I don’t quite understand Flink, but I interpret it as datapipeline (Kafka) → data compute module → TiDB;
  2. Since the data compute module can determine the same user in the same slot, is there concurrent sink when this slot is synchronized downwards? For example, if there are 100 records each time, to speed up synchronization efficiency, the same slot might open 10 TiDB connections (pool). This could cause conflicts even if it’s the same slot, as splitting the same slot into connections might lead to concurrency issues;
  3. During the process from data compute module → TiDB table, is the data compute module the only one operating on this table? Is it possible that a deadlock is caused by two transactions, one from the Flink sink and the other from another synchronization stream?
  4. Can user_id=xx and reg_date=xx and game_version=xx ensure uniqueness? Would the absence of batch_sn cause update interference? It feels more like there’s update interference here :thinking:, because TiDB uses SI to fetch data, meaning the TSO of this data has already been written and committed. After that, a deadlock occurs between two transactions on the same row of data.
  5. When a deadlock occurs, is there relevant information about the deadlock key in the logs? You can follow this clue and investigate further.
  6. Common deadlocks, as shown in the image below: