TiFlash Suddenly Crashed - Detected Invalid Null

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tiflash突然挂了-Detected invalid null

| username: magongyong

[TiDB Usage Environment] Production Environment
[TiDB Version] v6.5.5
[Reproduction Path] What operations were performed when the issue occurred
No operations were performed, it suddenly crashed, and automatic restart failed

[Encountered Issue: Problem Phenomenon and Impact]

[Resource Configuration] Enter TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

| username: zhanggame1 | Original post link

How about displaying the cluster status?

| username: 像风一样的男子 | Original post link

How about checking the cluster status?

| username: TiDBer_小阿飞 | Original post link

It feels like the cluster suddenly went down, doesn’t it?

| username: oceanzhang | Original post link

Check the network traffic of a specific TiKV node to see if there are any anomalies.

| username: magongyong | Original post link

Sorry, I got the wrong log, it has been corrected.

| username: magongyong | Original post link

The logs were pulled incorrectly, and I have taken a new screenshot.

| username: magongyong | Original post link

At first, it was a disconnect, now it’s down. The logs were pulled incorrectly, and I have taken a new screenshot.

| username: magongyong | Original post link

All other components are normal. This TiFlash instance was initially disconnected and is now down. The logs were pulled incorrectly, and I have taken a new screenshot.

| username: magongyong | Original post link

[2023/11/24 10:10:37.234 +08:00] [ERROR] [Exception.cpp:89] ["Code: 49, e.displayText() = DB::Exception: Detected invalid null when decoding data of column denomination with column type Decimal64: physical_table_id=3668: (while preHandleSnapshot region_id=2680177673, index=847, term=21), e.what() = DB::Exception, Stack trace:

0x1718afe DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, int) [tiflash+24218366]
dbms/src/Common/Exception.h:46
0x6b1536a bool DB::appendRowV2ToBlockImpl(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::__map_const_iterator<std::__1::__tree_const_iterator<std::__1::__value_type<long, unsigned long>, std::__1::__tree_node<std::__1::__value_type<long, unsigned long>, void*>, long> >, std::__1::__map_const_iterator<std::__1::__tree_const_iterator<std::__1::__value_type<long, unsigned long>, std::__1::__tree_node<std::__1::__value_type<long, unsigned long>, void>, long> >, DB::Block&, unsigned long, std::__1::vector<TiDB::ColumnInfo, std::__1::allocatorTiDB::ColumnInfo > const&, long, bool, bool) [tiflash+112284522]
dbms/src/Storages/Transaction/RowCodec.cpp:487
0x6b13824 DB::appendRowToBlock(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::__map_const_iterator<std::__1::__tree_const_iterator<std::__1::__value_type<long, unsigned long>, std::__1::__tree_node<std::__1::__value_type<long, unsigned long>, void
>, long> >, std::__1::__map_const_iterator<std::__1::__tree_const_iterator<std::__1::__value_type<long, unsigned long>, std::__1::__tree_node<std::__1::__value_type<long, unsigned long>, void>, long> >, DB::Block&, unsigned long, std::__1::shared_ptr<DB::DecodingStorageSchemaSnapshot const> const&, bool) [tiflash+112277540]
dbms/src/Storages/Transaction/RowCodec.cpp:349
0x6ae0e53 bool DB::RegionBlockReader::readImpl<(DB::TMTPKType)0>(DB::Block&, std::__1::vector<std::__1::tuple<DB::RawTiDBPK, unsigned char, unsigned long, std::__1::shared_ptr<DB::StringObject const> >, std::__1::allocator<std::__1::tuple<DB::RawTiDBPK, unsigned char, unsigned long, std::__1::shared_ptr<DB::StringObject const> > > > const&, bool) [tiflash+112070227]
dbms/src/Storages/Transaction/RegionBlockReader.cpp:146
0x6abac58 DB::GenRegionBlockDataWithSchema(std::__1::shared_ptrDB::Region const&, std::__1::shared_ptr<DB::DecodingStorageSchemaSnapshot const> const&, unsigned long, bool, DB::TMTContext&) [tiflash+111914072]
dbms/src/Storages/Transaction/PartitionStreams.cpp:598
0x6a7089a DB::DM::SSTFilesToBlockInputStream::readCommitedBlock() [tiflash+111610010]
dbms/src/Storages/DeltaMerge/SSTFilesToBlockInputStream.cpp:255
0x6a6f30e DB::DM::SSTFilesToBlockInputStream::read() [tiflash+111604494]
dbms/src/Storages/DeltaMerge/SSTFilesToBlockInputStream.cpp:154
0x6946ea5 DB::DM::readNextBlock(std::__1::shared_ptrDB::IBlockInputStream const&) [tiflash+110390949]
dbms/src/Storages/DeltaMerge/DeltaMergeHelpers.h:253
0x6a71dec DB::DM::PKSquashingBlockInputStream::read() [tiflash+111615468]
dbms/src/Storages/DeltaMerge/PKSquashingBlockInputStream.h:78
0x6946ea5 DB::DM::readNextBlock(std::__1::shared_ptrDB::IBlockInputStream const&) [tiflash+110390949]
dbms/src/Storages/DeltaMerge/DeltaMergeHelpers.h:253
0x16cbd35 DB::DM::DMVersionFilterBlockInputStream<1>::initNextBlock() [tiflash+23903541]
dbms/src/Storages/DeltaMerge/DMVersionFilterBlockInputStream.h:137
0x16cb56b DB::DM::DMVersionFilterBlockInputStream<1>::read(DB::PODArray<unsigned char, 4096ul, Allocator, 15ul, 16ul>
&, bool) [tiflash+23901547]
dbms/src/Storages/DeltaMerge/DMVersionFilterBlockInputStream.cpp:323
0x6a71018 DB::DM::BoundedSSTFilesToBlockInputStream::read() [tiflash+111611928]
dbms/src/Storages/DeltaMerge/SSTFilesToBlockInputStream.cpp:307
0x16cf574 DB::DM::SSTFilesToDTFilesOutputStream<std::__1::shared_ptrDB::DM::BoundedSSTFilesToBlockInputStream >::write() [tiflash+23917940]
dbms/src/Storages/DeltaMerge/SSTFilesToDTFilesOutputStream.cpp:200
0x6a67b3f DB::KVStore::preHandleSSTsToDTFiles(std::__1::shared_ptrDB::Region, DB::SSTViewVec, unsigned long, unsigned long, DB::DM::FileConvertJobType, DB::TMTContext&) [tiflash+111573823]
dbms/src/Storages/Transaction/ApplySnapshot.cpp:360
0x6a67214 DB::KVStore::preHandleSnapshotToFiles(std::__1::shared_ptrDB::Region, DB::SSTViewVec, unsigned long, unsigned long, DB::TMTContext&) [tiflash+111571476]
dbms/src/Storages/Transaction/ApplySnapshot.cpp:275
0x6ac2516 PreHandleSnapshot [tiflash+111944982]
dbms/src/Storages/Transaction/ProxyFFI.cpp:388
0x7fd8813cc228 engine_store_ffi::_$LT$impl$u20$engine_store_ffi…interfaces…root…DB…EngineStoreServerHelper$GT$::pre_handle_snapshot::hec57f9b0ef29a0bb [libtiflash_proxy.so+17646120]
0x7fd8813c3d09 engine_store_ffi::observer::pre_handle_snapshot_impl::h0b40090f59175b24 [libtiflash_proxy.so+17612041]
0x7fd8813b6b86 yatp::task::future::RawTask$LT$F$GT$::poll::hd3296fb5cae316b9 [libtiflash_proxy.so+17558406]
0x7fd883242f13 _$LT$yatp…task…future…Runner$u20$as$u20$yatp…pool…runner…Runner$GT$::handle::h0056e31c4da70e35 [libtiflash_proxy.so+49590035]
0x7fd8832357fc std::sys_common::backtrace::__rust_begin_short_backtrace::h747afb2668c16dcb [libtiflash_proxy.so+49534972]
0x7fd88323631c core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h83ec6721ad8db87f [libtiflash_proxy.so+49537820]
0x7fd8829a36a5 std::sys::unix::thread::thread::new::thread_start::hd2791a9cabec1fda [libtiflash_proxy.so+40548005]
/rustc/96ddd32c4bfb1d78f0cd03eb068b1710a8cebeef/library/std/src/sys/unix/thread.rs:108
0x7fd8800e3e25 start_thread [libpthread.so.0+32293]
0x7fd87f4e9bad clone [libc.so.6+1043373]"] [source=“DB::RawCppPtr DB::PreHandleSnapshot(DB::EngineStoreServerWrap *, DB::BaseBuffView, uint64_t, DB::SSTViewVec, uint64_t, uint64_t)”] [thread_id=197]

| username: magongyong | Original post link

TiFlash just went down completely.

| username: 芮芮是产品 | Original post link

Deleting and rebuilding will solve the problem.

| username: 芮芮是产品 | Original post link

Fixing TiFlash is quite easy.

| username: magongyong | Original post link

The main issue is that there is a lot of data, and synchronization is very slow. It’s urgent online. :scream:

| username: JaySon-Huang | Original post link

From the stack trace, it looks like TiFlash encountered data that it couldn’t decode correctly into columns.

select `table_schema`,`table_name`, "" as partition_name from information_schema.tables where tidb_table_id='3668'
union
select `table_schema`,`table_name`,`partition_name` from information_schema.partitions where tidb_partition_id = '3668';

Use the above SQL to check which table 3668 belongs to, and then look at the schema of this table and what DDL operations have been performed on this table recently?

| username: heiwandou | Original post link

Add a new TiFlash node.

| username: magongyong | Original post link

Delete unique index, add unique index, recently only this operation.

| username: JaySon-Huang | Original post link

To restore the business, one method is to set the TiFlash replica of the table with table_id=3668 to 0, then expand new TiFlash nodes for reconstruction. This should restore the business.

Another method is to copy the data from the table to another table, table_new, which does not have a TiFlash replica. Then, perform ALTER TABLE table_old DROP COLUMN denomination on the original table. This way, TiFlash will not decode the data of that column, and you can try to bypass the issue on the original TiFlash node. However, since the root cause of the bug is not clear, this method does not guarantee that the business will be restored.

| username: magongyong | Original post link

We are not planning to synchronize this table and its corresponding database to TiFlash because the database corresponding to this table is not currently using TiFlash.

| username: JaySon-Huang | Original post link

Does the operation of deleting and adding unique indexes involve the denomination column?