TiCDC synchronization task is stuck, and TiKV keeps restarting after restarting CDC

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiCDC同步任务卡住,重启CDC之后TiKV一直重启

| username: 雪落香杉树

[TiDB Usage Environment] Production Environment
[TiDB Version] v5.4.0
[Encountered Problem: Phenomenon and Impact]
In the early morning, 4 CDC tasks under the cluster had a delay of more than 30 minutes. The task status was normal, but the checkpoint could not move forward. Previously, similar issues were resolved by restarting CDC.
However, after restarting, only two of the four tasks returned to normal, while the other two still could not advance the checkpoint normally.


At the same time, TiKV nodes also started experiencing down and restart issues.

TiKV logs continuously reported: cdc initialize fail, failed to send extra message, cdc initialize fail: Request error message: "peer is not leader for region

Subsequently, attempts were made to restart TiDB, PD, and TiKV components, but the two CDC tasks could not be restored, and TiKV kept restarting.
Later, the problematic tasks were paused, and TiKV stopped restarting. The issue was that these two tasks could not fetch the TiKV change data from the problematic time point, causing initialization failure and continuous TiKV restarts.
After confirming with the business, it was decided to skip a period of time after the problematic time point and start synchronization. A new task was created, specifying a timestamp to skip half an hour and start synchronization. The new task had no issues, and the checkpoint advanced normally.

[Resource Configuration]

[Attachments: Screenshots/Logs/Monitoring]
TiKV monitoring for the last 24 hours and TiCDC monitoring during the problematic period:
tidb-cluster3-TiKV-Details_2023-08-10T05_58_59.975Z.json (35.9 MB)
tidb-cluster3-TiCDC_2023-08-10T06_02_33.040Z.json (2.9 MB)

| username: zhanggame1 | Original post link

Run admin show ddl to check if there are any DDL tasks, and then analyze the table where the error region is located.

| username: 雪落香杉树 | Original post link

No DDL operations have been performed.

| username: 雪落香杉树 | Original post link

The main error is this kind, and it can basically be confirmed that it is those few tables synchronized by CDC, because the error disappears after pausing these two tasks, and TiKV also stops restarting constantly. However, I still don’t know what caused the problem, and I need to check what caused the task to get stuck around 12:03 AM.

| username: 雪落香杉树 | Original post link

Sorry, I cannot translate images. Please provide the text content for translation.

| username: 像风一样的男子 | Original post link

Seeing a lot of Transport(Full), is it because the bandwidth between clusters is insufficient?

| username: WalterWj | Original post link

TiCDC synchronization requires scanning regions, and synchronization will not proceed until the region scan is complete.

You should first resolve the TiKV restart issue.

  1. Investigate the reason for the TiKV restart. Is it due to OOM?
  2. TiCDC consumes a lot of memory when scanning regions. You haven’t deployed TiCDC and TiKV together, have you?
| username: redgame | Original post link

You need to find a balance point and try adjusting the CDC parameters, such as adjusting the number of CDC worker threads and the CDC memory limit, to improve the performance and stability of CDC.

| username: 雪落香杉树 | Original post link

Looking at the network bandwidth on the early morning of August 10th and other times, there are no obvious peaks, and it is not high either.

| username: 雪落香杉树 | Original post link

  1. Confirmed that it’s not an OOM issue. Upon restarting, the logs report an error: “cdc initialize fail,” causing the CDC task to pause and TiKV to stop restarting.
  2. They are deployed separately, and the machine load is not high.
| username: WalterWj | Original post link

Sounds like a bug.

  1. Either try upgrading
  2. Or redo this ticdc task.
| username: 雪落香杉树 | Original post link

It was redone. The original one was paused, and it resumed synchronization normally after skipping the specified time of 30 minutes.

| username: 雪落香杉树 | Original post link

Panic information in TiKV logs:

2023-08-10 00:44:56 (GMT+8)
UNKNOWN
TiKV x.x.x.x:20160
[lib.rs:465] ["assertion failed: `(left >= right)`\n left: `TimeStamp(443442001931665411)`,\n right: `TimeStamp(443442007383998887)`"] [backtrace=" 0: tikv_util::set_panic_hook::{{closure}}\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tikv_util/src/lib.rs:464:18\n 1: std::panicking::rust_panic_with_hook\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:626:17\n 2: std::panicking::begin_panic_handler::{{closure}}\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:519:13\n 3: std::sys_common::backtrace::__rust_end_short_backtrace\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/sys_common/backtrace.rs:141:18\n 4: rust_begin_unwind\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:515:5\n 5: std::panicking::begin_panic_fmt\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:457:5\n 6: tikv::storage::mvcc::reader::scanner::seek_for_valid_write\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/src/storage/mvcc/reader/scanner/mod.rs:448:17\n tikv::storage::mvcc::reader::scanner::seek_for_valid_value\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/src/storage/mvcc/reader/scanner/mod.rs:495:9\n 7: <tikv::storage::mvcc::reader::scanner::forward::DeltaEntryPolicy as tikv::storage::mvcc::reader::scanner::forward::ScanPolicy<S>>::handle_lock\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/src/storage/mvcc/reader/scanner/forward.rs:694:29\n tikv::storage::mvcc::reader::scanner::forward::ForwardScanner<S,P>::read_next\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/src/storage/mvcc/reader/scanner/forward.rs:271:42\n <tikv::storage::mvcc::reader::scanner::forward::ForwardScanner<S,P> as tikv::storage::txn::store::TxnEntryScanner>::next_entry\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/src/storage/mvcc/reader/scanner/forward.rs:837:12\n 8: cdc::endpoint::Initializer<E>::do_scan\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/endpoint.rs:1183:19\n cdc::endpoint::Initializer<E>::scan_batch::{{closure}}\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/endpoint.rs:1221:13\n <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/future/mod.rs:80:19\n cdc::endpoint::Initializer<E>::async_incremental_scan::{{closure}}\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/endpoint.rs:1133:27\n 9: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/future/mod.rs:80:19\n cdc::endpoint::Initializer<E>::on_change_cmd_response::{{closure}}\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/endpoint.rs:1066:13\n <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/future/mod.rs:80:19\n 10: cdc::endpoint::Initializer<E>::initialize::{{closure}}\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/endpoint.rs:1054:25\n <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/future/mod.rs:80:19\n cdc::endpoint::Endpoint<T,E>::on_register::{{closure}}\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/endpoint.rs:628:19\n <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/future/mod.rs:80:19\n tokio::runtime::task::core::CoreStage<T>::poll::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/core.rs:161:17\n tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/loom/std/unsafe_cell.rs:14:9\n tokio::runtime::task::core::CoreStage<T>::poll\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/core.rs:151:13\n tokio::runtime::task::harness::poll_future::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/harness.rs:461:19\n <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panic.rs:347:9\n std::panicking::try::do_call\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:401:40\n std::panicking::try\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:365:19\n std::panic::catch_unwind\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panic.rs:434:14\n tokio::runtime::task::harness::poll_future\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/harness.rs:449:18\n tokio::runtime::task::harness::Harness<T,S>::poll_inner\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/harness.rs:98:27\n tokio::runtime::task::harness::Harness<T,S>::poll\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/harness.rs:53:15\n tokio::runtime::task::raw::poll\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/raw.rs:113:5\n 12: tokio::runtime::task::raw::RawTask::poll\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/raw.rs:70:18\n tokio::runtime::task::LocalNotified<S>::run\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/mod.rs:343:9\n tokio::runtime::thread_pool::worker::Context::run_task::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/thread_pool/worker.rs:420:13\n tokio::coop::with_budget::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/coop.rs:106:9\n std::thread::local::LocalKey<T>::try_with\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/thread/local.rs:399:16\n std::thread::local::LocalKey<T>::with\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/thread/local.rs:375:9\n tokio::coop::with_budget\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/coop.rs:99:5\n tokio::coop::budget\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/coop.rs:76:5\n tokio::runtime::thread_pool::worker::Context::run_task\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/thread_pool/worker.rs:419:9\n 13: tokio::runtime::thread_pool::worker::Context::run\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/thread_pool/worker.rs:386:24\n tokio::runtime::thread_pool::worker::run::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/thread_pool/worker.rs:371:17\n tokio::macros::scoped_tls::ScopedKey<T>::set\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/macros/scoped_tls.rs:61:9\n tokio::runtime::thread_pool::worker::run\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/thread_pool/worker.rs:368:5\n tokio::runtime::thread_pool::worker::Launch::launch::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/thread_pool/worker.rs:347:45\n <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/blocking/task.rs:42:21\n tokio::runtime::task::core::CoreStage<T>::poll::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/core.rs:161:17\n tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/loom/std/unsafe_cell.rs:14:9\n tokio::runtime::task::core::CoreStage<T>::poll\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/core.rs:151:13\n tokio::runtime::task::harness::poll_future::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/harness.rs:461:19\n <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panic.rs:347:9\n std::panicking::try::do_call\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:401:40\n std::panicking::try\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:365:19\n std::panic::catch_unwind\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panic.rs:434:14\n tokio::runtime::task::harness::poll_future\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/harness.rs:449:18\n tokio::runtime::task::harness::Harness<T,S>::poll_inner\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/harness.rs:98:27\n tokio::runtime::task::harness::Harness<T,S>::poll\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/harness.rs:53:15\n tokio::runtime::task::raw::poll\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/raw.rs:113:5\n 14: tokio::runtime::task::raw::RawTask::poll\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/raw.rs:70:18\n tokio::runtime::task::UnownedTask<S>::run\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/task/mod.rs:379:9\n tokio::runtime::blocking::pool::Inner::run\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/blocking/pool.rs:265:17\n tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/blocking/pool.rs:245:17\n std::sys_common::backtrace::__rust_begin_short_backtrace\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/sys_common/backtrace.rs:125:18\n 15: std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/thread/mod.rs:476:17\n <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panic.rs:347:9\n std::panicking::try::do_call\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:401:40\n std::panicking::try\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:365:19\n std::panic::catch_unwind\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panic.rs:434:14\n std::thread::Builder::spawn_unchecked::{{closure}}\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/thread/mod.rs:475:30\n core::ops::function::FnOnce::call_once{{vtable.shim}}\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/ops/function.rs:227:5\n 16: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/alloc/src/boxed.rs:1572:9\n <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/alloc/src/boxed.rs:1572:9\n std::
| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.