TiFlash Data Not Synchronized: Unable to Connect to TiFlash After Executing [Create Replica]

translator_bot · June 22, 2024, 12:28pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tiflash不同步数据，执行完【创建副本】执行链接不到tiflash

| username: TiDBer_zhw

[TiDB Usage Environment] Production Environment
[TiDB Version] v6.5.1
[Reproduction Path]
Execute ALTER TABLE sxt_order SET TIFLASH REPLICA 1;
After a few minutes, the cluster cannot connect to TiFlash, and there are error logs.
The log is as follows:

[2023/03/30 17:24:39.182 +08:00] [ERROR] [Exception.cpp:89] ["Code: 49, e.displayText() = DB::Exception: invalid flag 83 in write cf: physical_table_id=10479: (while preHandleSnapshot region_id=6513, index=45900, term=7), e.what() = DB::Exception, Stack trace:\n\n\n       0x17225ce\tDB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) [tiflash+24257998]\n                \tdbms/src/Common/Exception.h:46\n       0x6b08404\tDB::RegionCFDataBase<DB::RegionWriteCFDataTrait>::insert(DB::StringObject<true>&&, DB::StringObject<false>&&) [tiflash+112231428]\n                \tdbms/src/Storages/Transaction/RegionCFDataBase.cpp:46\n       0x6a949db\tDB::DM::SSTFilesToBlockInputStream::read() [tiflash+111757787]\n                \tdbms/src/Storages/DeltaMerge/SSTFilesToBlockInputStream.cpp:137\n       0x696ca15\tDB::DM::readNextBlock(std::__1::shared_ptr<DB::IBlockInputStream> const&) [tiflash+110545429]\n                \tdbms/src/Storages/DeltaMerge/DeltaMergeHelpers.h:253\n       0x6a97412\tDB::DM::PKSquashingBlockInputStream<true>::read() [tiflash+111768594]\n                \tdbms/src/Storages/DeltaMerge/PKSquashingBlockInputStream.h:68\n       0x696ca15\tDB::DM::readNextBlock(std::__1::shared_ptr<DB::IBlockInputStream> const&) [tiflash+110545429]\n                \tdbms/src/Storages/DeltaMerge/DeltaMergeHelpers.h:253\n       0x16d57e5\tDB::DM::DMVersionFilterBlockInputStream<1>::initNextBlock() [tiflash+23943141]\n                \tdbms/src/Storages/DeltaMerge/DMVersionFilterBlockInputStream.h:137\n       0x16d360c\tDB::DM::DMVersionFilterBlockInputStream<1>::read(DB::PODArray<unsigned char, 4096ul, Allocator<false>, 15ul, 16ul>*&, bool) [tiflash+23934476]\n                \tdbms/src/Storages/DeltaMerge/DMVersionFilterBlockInputStream.cpp:51\n       0x6a96868\tDB::DM::BoundedSSTFilesToBlockInputStream::read() [tiflash+111765608]\n                \tdbms/src/Storages/DeltaMerge/SSTFilesToBlockInputStream.cpp:307\n       0x16d9044\tDB::DM::SSTFilesToDTFilesOutputStream<std::__1::shared_ptr<DB::DM::BoundedSSTFilesToBlockInputStream> >::write() [tiflash+23957572]\n                \tdbms/src/Storages/DeltaMerge/SSTFilesToDTFilesOutputStream.cpp:200\n       0x6a8d38f\tDB::KVStore::preHandleSSTsToDTFiles(std::__1::shared_ptr<DB::Region>, DB::SSTViewVec, unsigned long, unsigned long, DB::DM::FileConvertJobType, DB::TMTContext&) [tiflash+111727503]\n                \tdbms/src/Storages/Transaction/ApplySnapshot.cpp:360\n       0x6a8ca64\tDB::KVStore::preHandleSnapshotToFiles(std::__1::shared_ptr<DB::Region>, DB::SSTViewVec, unsigned long, unsigned long, DB::TMTContext&) [tiflash+111725156]\n                \tdbms/src/Storages/Transaction/ApplySnapshot.cpp:275\n       0x6ae7d66\tPreHandleSnapshot [tiflash+112098662]\n                \tdbms/src/Storages/Transaction/ProxyFFI.cpp:388\n  0x7f693aa9a228\tengine_store_ffi::_$LT$impl$u20$engine_store_ffi..interfaces..root..DB..EngineStoreServerHelper$GT$::pre_handle_snapshot::hec57f9b0ef29a0bb [libtiflash_proxy.so+17646120]\n  0x7f693aa91d09\tengine_store_ffi::observer::pre_handle_snapshot_impl::h0b40090f59175b24 [libtiflash_proxy.so+17612041]\n  0x7f693aa84b86\tyatp::task::future::RawTask$LT$F$GT$::poll::hd3296fb5cae316b9 [libtiflash_proxy.so+17558406]\n  0x7f693c910dc3\t_$LT$yatp..task..future..Runner$u20$as$u20$yatp..pool..runner..Runner$GT$::handle::h0056e31c4da70e35 [libtiflash_proxy.so+49589699]\n  0x7f693c9036ac\tstd::sys_common::backtrace::__rust_begin_short_backtrace::h747afb2668c16dcb [libtiflash_proxy.so+49534636]\n  0x7f693c9041cc\tcore::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h83ec6721ad8db87f [libtiflash_proxy.so+49537484]\n  0x7f693c071555\tstd::sys::unix::thread::Thread::new::thread_start::hd2791a9cabec1fda [libtiflash_proxy.so+40547669]\n                \t/rustc/96ddd32c4bfb1d78f0cd03eb068b1710a8cebeef/library/std/src/sys/unix/thread.rs:108\n  0x7f69397b1ea5\tstart_thread [libpthread.so.0+32421]\n  0x7f6938bb6b0d\tclone [libc.so.6+1043213]"] [source="DB::RawCppPtr DB::PreHandleSnapshot(DB::EngineStoreServerWrap *, DB::BaseBuffView, uint64_t, DB::SSTViewVec, uint64_t, uint64_t)"] [thread_id=30]

TiFlash error log:

Cluster configuration:

The status was normal before execution, but after executing ALTER TABLE sxt_order SET TIFLASH REPLICA 1, it got disconnected.

TiFlash has already tried scaling in and out but cannot solve the problem.
After the error, the TiFlash deployment directory keeps generating a core file of about 1G in size, which intuitively feels like some overflow.

I verified this bug, and the result is: as long as the downstream TiDB cluster has received data synchronized from the upstream CDC, the downstream TiFlash will definitely crash. No matter how the downstream TiFlash scales in and out, it cannot start normally again. This issue is consistently reproducible in v6.5.0, v6.5.1, and v7.0.0.

translator_bot · June 22, 2024, 12:28pm

| username: tidb菜鸟一只 | Original post link

Check the cluster status.

translator_bot · June 22, 2024, 12:28pm

| username: TiDBer_zhw | Original post link

translator_bot · June 22, 2024, 12:28pm

| username: Kongdom | Original post link

Are there any error messages when starting the TiFlash node?

translator_bot · June 22, 2024, 12:28pm

| username: TiDBer_zhw | Original post link

When starting, it directly prompts a timeout. Currently, the above log keeps refreshing. Additionally, the log shows that TiFlash is repeatedly starting and crashing, eventually causing the disk to fill up, which now prevents it from starting.

translator_bot · June 22, 2024, 12:28pm

| username: h5n1 | Original post link

select * from information_schema.tiflash_replica;
pd-ctl region 6513
Are there many different region_ids in the error log or are they concentrated on a few?

translator_bot · June 22, 2024, 12:28pm

| username: tidb狂热爱好者 | Original post link

Are you still the cluster that I repaired?

translator_bot · June 22, 2024, 12:28pm

| username: TiDBer_zhw | Original post link

No, this is a newly deployed cluster. I have supplemented the cluster topology. Please take a look.

translator_bot · June 22, 2024, 12:28pm

| username: TiDBer_zhw | Original post link

Only 1

translator_bot · June 22, 2024, 12:28pm

| username: vcdog | Original post link

Is there ticdc synchronizing data to this TiDB cluster from upstream? If so, there is a bug in version v6.5.0 where tiflash will restart infinitely and generate core files.

translator_bot · June 22, 2024, 12:28pm

| username: tidb狂热爱好者 | Original post link

Sure, please provide the Chinese text you need translated.

translator_bot · June 22, 2024, 12:28pm

| username: TiDBer_zhw | Original post link

There is an upstream CDC synchronizing data to this TiDB cluster, but stopping the upstream doesn’t work either. The stop command is tiup ctl:v6.5.0 cdc changefeed pause, which pauses without deleting the CDC task.

translator_bot · June 22, 2024, 12:28pm

| username: TiDBer_zhw | Original post link

Is there a solution?

translator_bot · June 22, 2024, 12:28pm

| username: TiDBer_zhw | Original post link

I have verified this bug, and the verification result is: as long as the downstream TiDB cluster has received data synchronized from the upstream CDC, the downstream TiFlash will definitely crash. No matter how you scale in or out the downstream TiFlash, it cannot start normally again. This issue is consistently reproducible in versions v6.5.0, v6.5.1, and v7.0.0.

translator_bot · June 22, 2024, 12:28pm

| username: Billmay表妹 | Original post link

The related issue at TiFlash crash for cdc compatibility · Issue #7212 · pingcap/tiflash · GitHub has already been recorded and will be resolved as soon as possible.

The problem was caused by this PR: txn: add a new field `txn_source` in write and lock by xiongjiwei · Pull Request #13777 · tikv/tikv · GitHub

translator_bot · June 22, 2024, 12:28pm

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.