Issue of Data Update Failure in TiDB Cluster

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 关于TIDB集群更新数据失败问题

| username: TiDBer_mZMtuKD6

[TiDB Usage Environment] Testing
[TiDB Version] V6.1
[Reproduction Path] In our scenario, there are 2 processes. The first one sends an insert command and then sends a message to the subsequent MQ, after which the process itself closes. The second business process retrieves the message and continues the business, then updates the record, e.g., type=4. In this case, single record scenarios do not fail, but when performing batch tasks, there is a probability of update failure.
[Encountered Problem: Problem Phenomenon and Impact]
[Resource Configuration]*

[Attachment: Screenshot/Log/Monitoring]

| username: TiDBer_mZMtuKD6 | Original post link

I hope everyone can help analyze this~

| username: 像风一样的男子 | Original post link

Can you see updates in slow queries?

| username: Kongdom | Original post link

What is the behavior? Does the update statement report an error, or is the update value incorrect?

| username: zhanggame1 | Original post link

There is a probability of update failure. What is the specific failure result?

| username: Fly-bird | Original post link

The update is executed repeatedly, right?

| username: TiDBer_mZMtuKD6 | Original post link

The code layer prints that the number of affected rows returned is 1, but when checking the database, the data hasn’t been updated. After the update, if we fetch the data directly using the code, we can get the updated taskData, but this is only readable within the current process. If we check the data using a DB tool, the data is not updated. Additionally, using another interface to read with the PK also doesn’t retrieve the updated data.

| username: TiDBer_mZMtuKD6 | Original post link

The code layer prints that the number of affected rows returned is 1, but when checking the database, the data hasn’t been updated. After the update, if we fetch the data directly using the code, we can get the updated taskData, but this is only readable within the current process. If we check the data using a DB tool, the data hasn’t been updated. Additionally, using another interface to read with the PK also doesn’t retrieve the updated data. The MQ message is consumed only once, so it shouldn’t be executed repeatedly.

| username: TiDBer_mZMtuKD6 | Original post link

I checked, and there is no such slow query log.

| username: 有猫万事足 | Original post link

It looks like a typical transaction control issue.

You need to pay attention to how your code controls transactions. Is there any framework that has configured a transaction manager to help you start a transaction? Especially in the update process, it seems like a non-auto-commit transaction is opened inside, but it hasn’t been committed.

| username: TiDBer_mZMtuKD6 | Original post link

When we first tested it, if we added a 500ms sleep in the process of executing the update, the update would succeed 100% of the time. Evaluating this as a transaction, it shouldn’t be affected by the 500ms between the insert and update.

| username: TiDBer_mZMtuKD6 | Original post link

Another test scenario involves 10 such tasks, where some updates may be successful and some may not be.

| username: 有猫万事足 | Original post link

Since it’s so easy to reproduce, I strongly suggest you post the code directly. I’ll give it a try as well.

| username: andone | Original post link

Is it blocked or locked again?

| username: 像风一样的男子 | Original post link

Did you execute the update via the console or through code?

| username: Kongdom | Original post link

The probability of this occurring is quite high. I suggest posting the code statements so that everyone can try to reproduce it.

| username: zhanggame1 | Original post link

The number of affected rows returned is always 1. Does that mean there was no commit?

| username: 我是咖啡哥 | Original post link

Lock conflicts during batch processing caused some transactions to roll back.