Error 1205 Reported After Upgrading TiDB Cluster

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIDB集群升级后报错1205

| username: jaybing926

[TiDB Usage Environment] Production Environment
[TiDB Version]
v5.4.3 → v7.1.5
[Encountered Problem: Phenomenon and Impact]
After upgrading the cluster, there were a large number of 1205 errors when writing data: OperationalError: (1205, u’Lock wait timeout exceeded; try restarting transaction’). The overall data write speed has significantly decreased compared to before.
Business scenario: Python consumes Kafka data and writes it to TiDB. Currently, there are 24 consumer processes.
It seems that starting a few consumer processes does not cause errors, but starting more does. Not sure if this is the case.
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]



---- Update on 20240528 ----
The code itself indeed had issues. After the developers optimized the code (Kafka partition mechanism), the 1205 conflict errors no longer occurred. This fundamentally solved the problem.
However, what is still not understood is: why did the same code and the same database configuration avoid this issue in the old version, but it appeared after the upgrade and could not be avoided by modifying the configuration?

| username: WalterWj | Original post link

It might not be related to the TiDB upgrade. The database restart might not have handled this kind of scenario in the code, and during exceptions, concentrated writes could lead to obvious conflicts, which is not impossible…

| username: Kongdom | Original post link

:thinking: Could it be that the mechanism of the new version is different from the old version?

| username: jaybing926 | Original post link

Restarting the database is equivalent to restarting the service from a code perspective. We also frequently restart TiDB. It’s still an issue between TiDB versions and has nothing to do with the code.

| username: jaybing926 | Original post link

Yes, what mechanism is different?
There was no problem with the original setup. The same code and the same database configuration, but after the upgrade, there is an issue. :innocent:

| username: Kongdom | Original post link

:yum: @WalterWJ Junjun, could you take a look and see if there’s any mechanism?

| username: WalterWj | Original post link

Honestly, you need to monitor this thing very carefully. I’ve taken a look at it, but it’s not something that can be understood quickly. It’s frustrating. :face_holding_back_tears:

May I ask if the issue has been resolved now? For example, can the consumption process be completed…

| username: jaybing926 | Original post link

Well, it’s already resolved. After optimizing the code, there are no more errors.
Actually, the two consumers were able to handle it yesterday.
It’s just a bit puzzling why this issue occurred.

| username: WalterWj | Original post link

Let’s leave it at that for now. Later, I’ll write an article with a performance analysis before and after the upgrade.