The min resolved ts of ticdc has a long delay

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: ticdc的min resolved ts有较长的滞后

| username: TiDBer_uTApj3hn

[TiDB Usage Environment] Production environment, 4 TiKV nodes

[TiDB Version] TiKV 6.1

[Reproduction Path]
Listen to CDC, and based on the CDC content, use txnkv to write to TiKV.
Perform range delete operations simultaneously with the above operations.

[Encountered Issue: Phenomenon and Impact]
When txn writes to TiKV, the client reports an error:
[ERROR] [commit.go:182] [“2PC failed commit key after primary key committed”] [error=“Error(Txn(Error(Mvcc(Error(TxnLockNotFound { start_ts: TimeStamp(439331133747363841), commit_ts: TimeStamp(439331134009508037), key: [0, 0, 0, 0, 0, 0, 0, 11, 0, 0, 88, 195, 0, 0, 1, 132, 128, 217, 196, 16, 151, 5, 146, 83, 118, 61, 0, 0] })))))”] [errorVerbose=“Error(Txn(Error(Mvcc(Error(TxnLockNotFound { start_ts: TimeStamp(439331133747363841), commit_ts: TimeStamp(439331134009508037), key: [0, 0, 0, 0, 0, 0, 0, 11, 0, 0, 88, 195, 0, 0, 1, 132, 128, 217, 196, 16, 151, 5, 146, 83, 118, 61, 0, 0] })))))\ngithub.com/tikv/client-go/v2/error.ExtractKeyErr\n\t/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220531092439-efebaeb9fe53/error/error.go:259\ngithub.com/tikv/client-go/v2/txnkv/transaction.actionCommit.handleSingleBatch\n\t/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220531092439-efebaeb9fe53/txnkv/transaction/commit.go:171\ngithub.com/tikv/client-go/v2/txnkv/transaction.(*batchExecutor).startWorker.func1\n\t/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220531092439-efebaeb9fe53/txnkv/transaction/2pc.go:1993\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1571”] [txnStartTS=439331133747363841] [commitTS=439331134009508037] [keys=“[000000000000000b00004e1900000186345e80b0fee6f0a456410000,000000000000000b000058c30000018480d9c41097059253763d0000,000000000000000b00005a1f000001863424e69897059253762c0000]”] [stack=“github.com/tikv/client-go/v2/txnkv/transaction.actionCommit.handleSingleBatch\n\t/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220531092439-efebaeb9fe53/txnkv/transaction/commit.go:182\ngithub.com/tikv/client-go/v2/txnkv/transaction.(*batchExecutor).startWorker.func1\n\t/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220531092439-efebaeb9fe53/txnkv/transaction/2pc.go:1993”]

From this point on, the service starts to fail to listen to CDC information. Upon checking the monitoring, it was found that the min resolved ts of a certain machine was always lagging and not changing.

Looking at the Golang TiKV client code, it seems that the error “2PC failed commit key after primary key committed” might be a very serious bug?

[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

ticdc Monitoring

The green part in the picture is the one that was always lagging, and it recovered by itself later.

tikv Related Logs

[INFO] [commit.rs:67] [“txn conflict (lock not found)”] [commit_ts=439331134009508037] [start_ts=439331133747363841] [key=000000000000000BFF000058C300000184FF80D9C41097059253FF763D000000000000FB]

[WARN] [errors.rs:339] [“txn conflicts”] [err=“Error(Txn(Error(Mvcc(Error(TxnLockNotFound { start_ts: TimeStamp(439331133747363841), commit_ts: TimeStamp(439331134009508037), key: [0, 0, 0, 0, 0, 0, 0, 11, 0, 0, 88, 195, 0, 0, 1, 132, 128, 217, 196, 16, 151, 5, 146, 83, 118, 61, 0, 0] })))))”]

| username: xfworld | Original post link

The primary key for the commit was not found, basically, the lock was lost…

Why does commit timeout occur? You need to check your environment and configuration.

This has little to do with CDC.

| username: TiDBer_uTApj3hn | Original post link

Hello, may I ask what could cause a lock to be lost?

Additionally, when you mention a commit timeout, do you mean that the commit phase never ends? Doesn’t a commit timeout in TiKV automatically fail?

| username: xfworld | Original post link

Theoretically, a commit can have two outcomes: success and rollback.

If it is a rollback, it is not necessarily related to the release of locks. Refer to the documentation:
https://docs.pingcap.com/zh/tidb/stable/garbage-collection-overview#resolve-locks清理锁

Regarding the issue you raised, due to the lack of environmental information, it is difficult to make a judgment.
From my experience, such situations generally do not occur…

| username: TiDBer_uTApj3hn | Original post link

It should be that the primary key of the transaction has been committed, but the secondary key has not yet been committed. At this time, the primary key was deleted by the delete range operation, which caused the secondary key to be unable to commit.

I have another question to ask. According to the monitoring, the resolve ts recovered by itself after a day. In the case of primary key loss, can TiKV recover by itself?

| username: neilshen | Original post link

TiCDC does not support synchronizing data not written by TiDB, and the txnkv scenario mentioned in the post has not been tested. It is not recommended for use in production.

In the case of a missing primary key, can TiKV recover on its own?

For transactions with a secondary key but no primary key, theoretically, they cannot be automatically recovered. The recovery here might be because the key is no longer in the region monitored by TiCDC, for example, it has split into another region, or it might have been manually cleaned up.

| username: TiDBer_uTApj3hn | Original post link

Okay, thank you.