TiCDC latency is very high, and the background keeps logging "The resolvedTs is fallen back in kvclient"

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: ticdc 延迟非常高 后台一直刷 The resolvedTs is fallen back in kvclient

| username: leoones

[Overview] Scenario + Problem Overview
TiDB & PD mixed deployment with 3 nodes, TiCDC with 2 nodes, TiKV with 11 nodes.
In the afternoon, the 3 TiDB nodes in the cluster experienced OOM, causing many TiCDC tasks to stall. The background continuously logs:
The resolvedTs is fallen back in kvclient

[Background] Actions Taken
Restarted the entire cluster
Individually restarted all TiKV nodes
Individually restarted the CDC service

[Problem] Current Issue
TiCDC is still continuously logging.

[Business Impact]
Affects data push, causing business data delays and impacting users.
[TiDB Version]
V5.4.0

[Attachments] Relevant logs and monitoring (https://metricstool.pingcap.com/)

| username: xiaohetao | Original post link

What was the status of ticdc before you restarted tikv?

| username: leoones | Original post link

The CDC task is in a normal state.

| username: leoones | Original post link

Is there any way to solve it?

| username: tidb狂热爱好者 | Original post link

Environment

Cluster version: v5.1.2

Tidb server: 16c 32g 7 nodes

Ticdc machines: 16c 64g 2 nodes high availability

Issue occurred in versions: v4.0.9, v4.0.13, v5.1.2

After the Tidb server experienced an OOM, the ticdc checkpointTs did not advance. Attempting to pause the changefeed did not resolve the issue, and restarting the cdc component using tiup also did not help.

The ticdc logs showed numerous warning messages:

[2022/01/30 11:17:55.761 +08:00] [WARN] [region_worker.go:377] [“region not receiving resolved event from tikv or resolved ts is not pushing for too long time, try to resolve lock”] [regionID=40982339] [span=“[7480000000000016ffa05f72f000000019ff3a9b7b0000000000fa, 7480000000000016ffa05f72f000000019ff7319ad0000000000fa)”] [duration=17m21.65s] [lastEvent=93.273628ms] [resolvedTs=430836713476849684]

[2022/01/30 11:17:55.771 +08:00] [WARN] [region_worker.go:743] [“The resolvedTs is fallen back in kvclient”] [“Event Type”=RESOLVED] [resolvedTs=430836713699672811] [lastResolvedTs=430836984350245604] [regionID=31134532]

TiDB server OOM monitoring

Tidb cdc monitoring

Issue Investigation

Investigation of TiDB server OOM cause

From the related slow queries

The preliminary investigation result is that the R&D team querying the dashboard slow query page caused the tidb server to OOM.

The dashboard’s slow query page causing OOM has occurred multiple times from version 4.0 to 5.1.2, and the issue still exists in the current version. It will be fixed in future versions.

Current mitigation measures:

  • Minimize the number of slow-log files, such as regularly archiving and deleting slow logs or setting log.file.max-days, and increasing the slow query threshold log.slow-threshold.
  • When querying, select a smaller time range, such as within one hour, and try to avoid concurrent slow query searches by multiple users.
  • Avoid using the order by sorting function when querying.

Currently, tidb provides an OOM tracker tool for analyzing OOM issues, which can record the top memory SQL and heap statistics in the tmp directory of the tidb server for analysis. Compared to previous versions, troubleshooting is relatively simpler and easier.

Investigation of TiCDC issue

From the monitoring, it can be seen that after the tidb server OOM, the resolved-lag in some tikv nodes increased.

According to the cdc R&D personnel, tidb locks the related tables with optimistic or pessimistic locks during transactions. When the tidb server OOMs, the related locks are not released properly, and region merge will block the resolveLock function, causing the locks to not be released.

This issue has been fixed in the latest version 5.1.4.

Issue Handling

Versions 5.1.4 and later

The issue of the cdc process checkpointTs not advancing after the tidb server OOM can be resolved without special handling.

Versions before 5.1.4 require manual handling

Workaround:

In a previous similar issue, using select count(*) from tb_name to perform a count operation on the table released the related locks.

However, there may be cases where the count operation does not resolve the issue.

If it is a pessimistic lock, select count cannot unlock it, and the lock may exist on the index. In this case, use index needs to be used, i.e., count each index once.

How to determine if the lock is released after the count

If after the count(*) operation, starting the changefeed process and seeing a large number of “The resolvedTs is fallen back in kvclient” logs in the cdc log, it indicates that the lock has not been released. It is recommended to pause the problematic link before releasing the lock, as the problematic link can cause the checkpointTs of the normal link to not advance.

Issue fix: The kv client has a problem with the logic of triggering resolved lock, which cannot be triggered in time. For details, see:

https://github.com/pingcap/tiflow/issues/2867

Related Information

fallback resolvedTs event skips the lock resolving unexpectedly · Issue #3061 · pingcap/tiflow · GitHub

https://github.com/tikv/tikv/pull/11991

| username: leoones | Original post link

Currently, my version is already V5.4.0.

| username: xiaohetao | Original post link

Before restarting TiKV, was TiCDC running, and were the TiCDC instances or synchronization tasks not stopped?

| username: weixiaobing | Original post link

Has it been resolved yet? I have encountered a similar issue and it has not been resolved yet.

| username: HACK | Original post link

Minimize your configuration as much as possible, such as the number of TiCDC nodes, the number of synchronized tables, synchronization configuration parameters, etc., and then observe the situation.

| username: leoones | Original post link

Did not stop the CDC task before restarting TiKV.

| username: xiaohetao | Original post link

It is suspected that CDC might have been synchronizing at that time, and TiDB suddenly disconnected, causing the issue.

| username: xiaohetao | Original post link

You should still check the logs, inspect the logs of the systems and nodes involved in the CDC synchronization, and see if you can find any clues.

| username: xiaohetao | Original post link

Oh, also check the source end.

| username: leoones | Original post link

The issue has been resolved. Since the cluster version was v5.4.0, this bug was fixed in v5.4.1. The cluster has been upgraded to v5.4.2. Thank you.

| username: system | Original post link

This topic was automatically closed 1 minute after the last reply. No new replies are allowed.