TiFlash Abnormal Restart: Checksum Not Match

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tiflash异常重启 checksum not match

| username: Hacker_ojLJ8Ndr

[TiDB Usage Environment] Online
[TiDB Version] 6.1.0
[Encountered Problem]
Previously, accessing the table in TiFlash resulted in an error. After the error, the replica count was set to 0, and the error was as follows (error only, no restart):

Now, when re-enabling the replica for this table, synchronization reaches about 50% and TiFlash starts reporting errors, with the service continuously restarting.
The tiflash.log error is the same as above, with part of the log as follows:
[2022/07/27 16:56:48.764 +08:00] [ERROR] [Exception.cpp:85] ["void DB::BackgroundProcessingPool::threadFunction():Code: 40, e.displayText() = DB::Exception: Page[167976] field[1] checksum not match, broken file: /data01/deploy/data/data/t_17630/log/page_56_0/page, expected: b0d1876a36fa4582, but: a3e70a0a2ff59556, e.what() = DB::Exception, Stack trace:\

0x1d272d3\tStackTrace::StackTrace() [tiflash+30569171]
0x1d248d6\tDB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, int) [tiflash+30558422]
0x79f8633\tDB::PS::V2::PageFile::Reader::read(std::__1::vector<DB::PS::V2::PageFile::Reader::FieldReadInfo, std::__1::allocatorDB::PS::V2::PageFile::Reader::FieldReadInfo >&, std::__1::shared_ptrDB::ReadLimiter const&) [tiflash+127895091]
0x7a0c089\tDB::PS::V2::PageStorage::readImpl(unsigned long, std::__1::vector<std::__1::pair<unsigned long, std::__1::vector<unsigned long, std::__1::allocator > >, std::__1::allocator<std::__1::pair<unsigned long, std::__1::vector<unsigned long, std::__1::allocator > > > > const&, std::__1::shared_ptrDB::ReadLimiter const&, std::__1::shared_ptrDB::PageStorageSnapshot, bool) [tiflash+127975561]
0x7a8fbf0\tDB::PageReaderImplNormal::read(std::__1::vector<std::__1::pair<unsigned long, std::__1::vector<unsigned long, std::__1::allocator > >, std::__1::allocator<std::__1::pair<unsigned long, std::__1::vector<unsigned long, std::__1::allocator > > > > const&) const [tiflash+128515056]
0x7a8d3f2\tDB::PageReader::read(std::__1::vector<std::__1::pair<unsigned long, std::__1::vector<unsigned long, std::__1::allocator > >, std::__1::allocator<std::__1::pair<unsigned long, std::__1::vector<unsigned long, std::__1::allocator > > > > const&) const [tiflash+128504818]
0x7898948\tDB::DM::ColumnFileTiny::readFromDisk(DB::PageReader const&, std::__1::vector<DB::DM::ColumnDefine, std::__1::allocatorDB::DM::ColumnDefine > const&, unsigned long, unsigned long) const [tiflash+126454088]
0x7899124\tDB::DM::ColumnFileTiny::fillColumns(DB::PageReader const&, std::__1::vector<DB::DM::ColumnDefine, std::__1::allocatorDB::DM::ColumnDefine > const&, unsigned long, std::__1::vector<COWPtrDB::IColumn::immutable_ptrDB::IColumn, std::__1::allocator<COWPtrDB::IColumn::immutable_ptrDB::IColumn > >&) const [tiflash+126456100]
0x789a3f6\tDB::DM::ColumnFileTinyReader::readRows(std::__1::vector<COWPtrDB::IColumn::mutable_ptrDB::IColumn, std::__1::allocator<COWPtrDB::IColumn::mutable_ptrDB::IColumn > >&, unsigned long, unsigned long, DB::DM::RowKeyRange const*) [tiflash+126460918]
0x788fa13\tDB::DM::ColumnFileSetReader::readRows(std::__1::vector<COWPtrDB::IColumn::mutable_ptrDB::IColumn, std::__1::allocator<COWPtrDB::IColumn::mutable_ptrDB::IColumn > >&, unsigned long, unsigned long, DB::DM::RowKeyRange const*) [tiflash+126417427]
0x788f485\tDB::DM::ColumnFileSetReader::readPKVersion(unsigned long, unsigned long) [tiflash+126416005]
0x788fb51\tDB::DM::ColumnFileSetReader::getPlaceItems(std::__1::vector<DB::DM::BlockOrDelete, std::__1::allocatorDB::DM::BlockOrDelete >&, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long) [tiflash+126417745]

/var/log/message error as follows:
systemd: tiflash-9000.service: main process exited, code=killed, status=6/ABRT
systemd: Unit tiflash-9000.service entered failed state.
systemd: tiflash-9000.service failed.
systemd: tiflash-9000.service holdoff time over, scheduling restart.
systemd: Stopped tiflash service.
systemd: Started tiflash service.
bash: sync …
bash: real#0110m0.103s
bash: user#0110m0.000s
bash: sys#0110m0.074s
bash: ok

Subsequent Actions:
TiFlash node was taken offline. During the offline process, the service continued to restart. Later, it was forcibly taken offline. Currently, the node has been brought back online.

| username: flow-PingCAP | Original post link

It seems that the data file is corrupted.
Is the folder /data01/deploy/data/data/t_17630/log/page_56_0/ still there? Could you please package and send it to us so we can analyze the file content?

| username: Hacker_ojLJ8Ndr | Original post link

There is no backup of the files, only logs.

| username: flow-PingCAP | Original post link

May I ask if there were any operations on the TiFlash cluster before this issue occurred? For example, a restart or something similar?

| username: Hacker_ojLJ8Ndr | Original post link

The previous restart issue was confirmed to be caused by the continuous profiling feature tiflash由于句柄数升高导致重启 - TiDB 的问答社区. After disabling this feature, there have been no more restarts.

| username: Hacker_ojLJ8Ndr | Original post link

Setting the replica count to 0 won’t clean up the previous data files? If that’s the case, how should we clean them up regularly? How can we detect if the data files are corrupted?

| username: flow-PingCAP | Original post link

It seems that the continuous analysis of the flame graph caused repeated restarts, which triggered this issue and led to file corruption. We will try to reproduce the problem internally.

After setting the TiFlash replica to 0, the data is gradually reclaimed, not immediately deleted. In extreme cases, it may not be completely cleaned up (version 6.2 further fixes the issue of not cleaning up).

Sorry, currently TiFlash cannot proactively detect file corruption, but we plan to add this feature in the future.

| username: flow-PingCAP | Original post link

This issue also has users reporting file corruption due to IO errors: query raise the error of Unknown compression method: 200 when profiling in rhel 8 · Issue #5292 · pingcap/tiflash · GitHub

| username: Hacker_ojLJ8Ndr | Original post link

This content is indeed from my side, but it wasn’t uploaded by me~ :sweat:

| username: flow-PingCAP | Original post link


| username: system | Original post link

This topic will be automatically closed 60 days after the last reply. No new replies are allowed.