TiFlash Replica Progress Stuck When Setting by Database

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: Tiflash 按库设置副本进度卡住

| username: Curiouser

[TiDB Usage Environment] Production Environment / Test / Poc
Production Environment
[TiDB Version]
v7.2.0
[Reproduction Path] What operations were performed to encounter the problem

[Encountered Problem: Problem Phenomenon and Impact]

Problem 1: Replica progress stuck with error when setting by database

Plan to build TiFlash replicas by database

ALTER DATABASE db SET TIFLASH REPLICA 1;

Then use the following SQL to query the synchronization progress. After 40 minutes of synchronization, about 57 tables were synchronized. When querying the progress again, it was found that the synchronization was stuck and the progress did not move.

select count(*) from information_schema.tiflash_replica where progress !=1 ;

Check the tiflash_error.log log, and the following error log appears

[2023/11/14 11:13:38.934 +08:00] [WARN] [StoragePool.cpp:725] ["The DTFile is already exists, continue to acquire another ID. [call=std::tuple<String, PageIdU64> DB::DM::DeltaMergeStore::preAllocateIngestFile()][path=/tiflash-data/data/t_65415/stable] [id=2450]"] [source=db_7073.t_65415] [thread_id=857]
[2023/11/14 11:13:42.113 +08:00] [ERROR] [Exception.cpp:90] ["Code: 49, e.displayText() = DB::Exception: Check compare(range.getEnd(), ext_file.range.getEnd()) >= 0 failed, range.toDebugString() = [96104,98759), ext_file.range.toDebugString() = [96104,98760), e.what() = DB::Exception, Stack trace:\n\n\n       0x1c349ae\tDB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) [tiflash+29575598]\n                \tdbms/src/Common/Exception.h:46\n       0x73c771c\tDB::DM::DeltaMergeStore::ingestFiles(std::__1::shared_ptr<DB::DM::DMContext> const&, DB::DM::RowKeyRange const&, std::__1::vector<DB::DM::ExternalDTFileInfo, std::__1::allocator<DB::DM::ExternalDTFileInfo> > const&, bool) [tiflash+121403164]\n                \tdbms/src/Storages/DeltaMerge/DeltaMergeStore_Ingest.cpp:558\n       0x7ec2c34\tDB::DM::DeltaMergeStore::ingestFiles(DB::Context const&, DB::Settings const&, DB::DM::RowKeyRange const&, std::__1::vector<DB::DM::ExternalDTFileInfo, std::__1::allocator<DB::DM::ExternalDTFileInfo> > const&, bool) [tiflash+132918324]\n                \tdbms/src/Storages/DeltaMerge/DeltaMergeStore.h:312\n       0x7f6bfe0\tvoid DB::KVStore::checkAndApplyPreHandledSnapshot<DB::RegionPtrWithSnapshotFiles>(DB::RegionPtrWithSnapshotFiles const&, DB::TMTContext&) [tiflash+133611488]\n                \tdbms/src/Storages/Transaction/ApplySnapshot.cpp:140\n       0x7f6a5a6\tvoid DB::KVStore::applyPreHandledSnapshot<DB::RegionPtrWithSnapshotFiles>(DB::RegionPtrWithSnapshotFiles const&, DB::TMTContext&) [tiflash+133604774]\n                \tdbms/src/Storages/Transaction/ApplySnapshot.cpp:446\n       0x7fe6721\tApplyPreHandledSnapshot [tiflash+134113057]\n                \tdbms/src/Storages/Transaction/ProxyFFI.cpp:664\n  0x7fc7cc26809f\t_$LT$engine_store_ffi..observer..TiFlashObserver$LT$T$C$ER$GT$$u20$as$u20$raftstore..coprocessor..ApplySnapshotObserver$GT$::post_apply_snapshot::h9cf73b076a6be5ca [libtiflash_proxy.so+24293535]\n  0x7fc7cd284eed\traftstore::store::worker::region::Runner$LT$EK$C$R$C$T$GT$::handle_pending_applies::he8dd9bcf803128c3 [libtiflash_proxy.so+41189101]\n  0x7fc7cc876170\tyatp::task::future::RawTask$LT$F$GT$::poll::hd12828196a3b1f60 [libtiflash_proxy.so+30642544]\n  0x7fc7ce1cb08b\t_$LT$yatp..task..future..Runner$u20$as$u20$yatp..pool..runner..Runner$GT$::handle::h0559c073015890dd [libtiflash_proxy.so+57204875]\n  0x7fc7cc36da34\tstd::sys_common::backtrace::__rust_begin_short_backtrace::h0ad613d9614d6b0a [libtiflash_proxy.so+25365044]\n  0x7fc7cc3bd4a8\tcore::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h9e1cc15e5b5258a1 [libtiflash_proxy.so+25691304]\n  0x7fc7cd8fb7c5\tstd::sys::unix::thread::Thread::new::thread_start::hd2791a9cabec1fda [libtiflash_proxy.so+47966149]\n                \t/rustc/96ddd32c4bfb1d78f0cd03eb068b1710a8cebeef/library/std/src/sys/unix/thread.rs:108\n  0x7fc7ca41eea5\tstart_thread [libpthread.so.0+32421]\n  0x7fc7c9d2db0d\tclone [libc.so.6+1043213]"] [source="void DB::ApplyPreHandledSnapshot(DB::EngineStoreServerWrap *, PreHandledSnapshot *) [PreHandledSnapshot = DB::PreHandledSnapshotWithFiles]"] [thread_id=860]

At the same time, check tiup cluster display cluster and find that the TiFlash node frequently loses connection.
According to previous posts on the forum: “7.1.1 tiflash 应用快照,检查范围时不通过后无法重启 - TiDB 的问答社区” and “创建 TiFlash 数据副本时 TiFlash 实例失联 - #8,来自 tidb菜鸟一只 - TiDB 的问答社区”, I tried to use tiup to set tiflash profiles.default.dt_enable_ingest_check, but found that TiFlash could not start. It reported an unrecognized configuration.

Problem 2: After canceling the synchronized DB and restarting the TiFlash node, the next day the TiFlash node storage was overwhelmed

After the synchronization got stuck and reported an error, I urgently used the following SQL to cancel the synchronization. Then I restarted the TiFlash node. The next day, monitoring found that the TiFlash node disk was overwhelmed. One of the TiFlash disks, 1TB, was consumed by more than 800GB. Checking found that there were many core.xxx files in the TiFlash deployment directory. Each file was about 1GB in size.

ALTER DATABASE saas SET TIFLASH REPLICA 0;

[Resource Configuration]
Two PD nodes in the cluster, one TiFlash node, five TiKV nodes, and four TiDB nodes.
TiFlash hardware configuration: 32 cores, 64GB memory, two 1TB SSD storage disks as TiFlash data storage
The DB to be synchronized has about 430 tables, including two tables with 20-30 million rows.

tiflash.toml configuration file

default_profile = "default"
display_name = "TiFlash"
listen_host = "0.0.0.0"
tcp_port = 9000
tmp_path = "/tiflash-data/tmp"
[flash]
service_addr = "192.168.190.9:3930"
tidb_status_addr = "192.168.190.1:10080,192.168.190.2:10080,192.168.190.8:10080,192.168.190.9:10080,192.168.190.206:10080,192.168.190.207:10080"
[flash.flash_cluster]
cluster_manager_path = "/data/tidb/tiflash-9000/bin/tiflash/flash_cluster_manager"
log = "/data/tidb/tiflash-9000/logs/tiflash_cluster_manager.log"
master_ttl = 60
refresh_interval = 20
update_rule_interval = 5
[flash.proxy]
config = "/data/tidb/tiflash-9000/conf/tiflash-learner.toml"
[logger]
count = 20
errorlog = "/data/tidb/tiflash-9000/logs/tiflash_error.log"
level = "info"
log = "/data/tidb/tiflash-9000/logs/tiflash.log"
size = "1000M"
[profiles]
[profiles.default]
max_memory_usage = 0
[raft]
pd_addr = "192.168.190.1:2379,192.168.190.2:2379"
[status]
metrics_port = 8234
[storage]
[storage.main]
capacity = [912680550400, 912680550400]
dir = ["/tiflash-data", "/data/tidb/tiflash-9000/data"]

tiflash-learner.toml configuration file

log-file = "/data/tidb/tiflash-9000/logs/tiflash_tikv.log"
[raftstore]
apply-pool-size = 4
store-pool-size = 4
[rocksdb]
wal-dir = ""
[security]
ca-path = ""
cert-path = ""
key-path = ""
[server]
addr = "0.0.0.0:20170"
advertise-addr = "192.168.190.9:20170"
advertise-status-addr = "192.168.190.9:20292"
engine-addr = "192.168.190.9:3930"
status-addr = "0.0.0.0:20292"
[storage]
data-dir = "/tiflash-data/flash"

[Attachments: Screenshots/Logs/Monitoring]

| username: ShawnYan | Original post link

Which table is it stuck on? How many columns and how much data does that table have?

| username: Curiouser | Original post link

At the same time, 18 tables have no progress. It is impossible to tell which table it is.

| username: WalterWj | Original post link

Rebuild TiFlash, it seems that there is an issue with a bool type field: ExternalDTFileInfo> > const&, bool

| username: WalterWj | Original post link

You can add one table at a time :thinking: Not sure if it’s a data issue on your end or a TiFlash bug, it’s common to have some illegal data entered.

| username: xfworld | Original post link

The version seems a bit aggressive… At least 7.1.X…

| username: Curiouser | Original post link

What is this core.xxx file? What is its function? I stopped it but it is still growing.

| username: AnotherCalvinNeo | Original post link

The core indicates that TiFlash has panicked. You can check the tiflash_error.log for any stack traces.

| username: 路在何chu | Original post link

Are there any errors in the logs?

| username: Curiouser | Original post link

The image you provided is not visible. Please provide the text you need translated.

| username: Curiouser | Original post link

Yes. The posted log does not have line breaks :joy:

| username: Curiouser | Original post link

Keeping up with the trend

| username: andone | Original post link

Rebuild TiFlash