TiFlash Store Offline

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tiflash store offline

| username: TiDBer_NEw0xuKK

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] 7.1
[TiDB Operator Version] 1.4
[K8s Version] 1.20

A few days ago, I suddenly noticed that there was a store offline status in the TiDB monitoring:
image

Upon querying the database, I found that the TiFlash store was offline:

However, TiFlash is running normally, and restarting the TiFlash pod didn’t help:

At the same time, creating a new TiFlash replica has not been successful:

Please help analyze the reason, thank you.

| username: tidb菜鸟一只 | Original post link

Check the error log of TiFlash by running kubectl logs advanced-tidb-tiflash-0 errorlog -n tidb-admin.

| username: redgame | Original post link

You can check the logs of the TiKV node where the offline Store is located to understand what issues occurred with that Store. Additionally, you can check if there were any hotspot issues caused by uneven data distribution during the downtime of that Store.

| username: TiDBer_NEw0xuKK | Original post link

The log only has these two entries, could you please take a look? Is it because the memory setting is insufficient?

[2023/06/27 00:44:20.992 +08:00] [ERROR] [MPPTask.cpp:469] ["task running meets error: Code: 0, e.displayText() = DB::TiFlashException: Memory limit (total) exceeded caused by ‘out of memory quota for data computing’ : would use 5.60 GiB for data computing (attempt to allocate chunk of 8388608 bytes), limit of memory for data computing: 5.60 GiB, e.what() = DB::TiFlashException, Stack trace:

0x1c116d1 DB::TiFlashException::TiFlashException(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, DB::TiFlashError const&) [tiflash+29431505]
dbms/src/Common/TiFlashException.h:250
0x1c10c5e MemoryTracker::alloc(long, bool) [tiflash+29428830]
dbms/src/Common/MemoryTracker.cpp:154
0x1c108d5 MemoryTracker::alloc(long, bool) [tiflash+29427925]
dbms/src/Common/MemoryTracker.cpp:165
0x1c108d5 MemoryTracker::alloc(long, bool) [tiflash+29427925]
dbms/src/Common/MemoryTracker.cpp:165
0x1c2098b Allocator::alloc(unsigned long, unsigned long) [tiflash+29493643]
dbms/src/Common/Allocator.cpp:68
0x7db2f7c DB::ColumnString::reserve(unsigned long) [tiflash+131805052]
dbms/src/Columns/ColumnString.cpp:285
0x7983941 DB::Aggregator::prepareBlocksAndFillSingleLevel(DB::AggregatedDataVariants&, bool) const [tiflash+127416641]
dbms/src/Interpreters/Aggregator.cpp:1581
0x79aad08 DB::MergingBuckets::getDataForSingleLevel() [tiflash+127577352]
dbms/src/Interpreters/Aggregator.cpp:2262
0x791c77c DB::MergingAndConvertingBlockInputStream::readImpl() [tiflash+126994300]
dbms/src/DataStreams/MergingAndConvertingBlockInputStream.h:39
0x764cff5 DB::IProfilingBlockInputStream::read(DB::PODArray<unsigned char, 4096ul, Allocator, 15ul, 16ul>&, bool) [tiflash+124047349]
dbms/src/DataStreams/IProfilingBlockInputStream.cpp:75
0x764cce5 DB::IProfilingBlockInputStream::read() [tiflash+124046565]
dbms/src/DataStreams/IProfilingBlockInputStream.cpp:43
0x7a49c05 DB::AggregatingBlockInputStream::readImpl() [tiflash+128228357]
dbms/src/DataStreams/AggregatingBlockInputStream.cpp:79
0x764cff5 DB::IProfilingBlockInputStream::read(DB::PODArray<unsigned char, 4096ul, Allocator, 15ul, 16ul>
&, bool) [tiflash+124047349]
dbms/src/DataStreams/IProfilingBlockInputStream.cpp:75
0x764cce5 DB::IProfilingBlockInputStream::read() [tiflash+124046565]
dbms/src/DataStreams/IProfilingBlockInputStream.cpp:43
0x7644bfe DB::ExpressionBlockInputStream::readImpl() [tiflash+124013566]
dbms/src/DataStreams/ExpressionBlockInputStream.cpp:39
0x764cff5 DB::IProfilingBlockInputStream::read(DB::PODArray<unsigned char, 4096ul, Allocator, 15ul, 16ul>&, bool) [tiflash+124047349]
dbms/src/DataStreams/IProfilingBlockInputStream.cpp:75
0x764cce5 DB::IProfilingBlockInputStream::read() [tiflash+124046565]
dbms/src/DataStreams/IProfilingBlockInputStream.cpp:43
0x7644bfe DB::ExpressionBlockInputStream::readImpl() [tiflash+124013566]
dbms/src/DataStreams/ExpressionBlockInputStream.cpp:39
0x764cff5 DB::IProfilingBlockInputStream::read(DB::PODArray<unsigned char, 4096ul, Allocator, 15ul, 16ul>
&, bool) [tiflash+124047349]
dbms/src/DataStreams/IProfilingBlockInputStream.cpp:75
0x764cce5 DB::IProfilingBlockInputStream::read() [tiflash+124046565]
dbms/src/DataStreams/IProfilingBlockInputStream.cpp:43
0x7644bfe DB::ExpressionBlockInputStream::readImpl() [tiflash+124013566]
dbms/src/DataStreams/ExpressionBlockInputStream.cpp:39
0x764cff5 DB::IProfilingBlockInputStream::read(DB::PODArray<unsigned char, 4096ul, Allocator, 15ul, 16ul>&, bool) [tiflash+124047349]
dbms/src/DataStreams/IProfilingBlockInputStream.cpp:75
0x764cce5 DB::IProfilingBlockInputStream::read() [tiflash+124046565]
dbms/src/DataStreams/IProfilingBlockInputStream.cpp:43
0x820a3cb DB::ExchangeSenderBlockInputStream::readImpl() [tiflash+136356811]
dbms/src/DataStreams/ExchangeSenderBlockInputStream.cpp:40
0x764cff5 DB::IProfilingBlockInputStream::read(DB::PODArray<unsigned char, 4096ul, Allocator, 15ul, 16ul>
&, bool) [tiflash+124047349]
dbms/src/DataStreams/IProfilingBlockInputStream.cpp:75
0x764cce5 DB::IProfilingBlockInputStream::read() [tiflash+124046565]
dbms/src/DataStreams/IProfilingBlockInputStream.cpp:43
0x82b5284 DB::DataStreamExecutor::execute(DB::ResultHandler&&) [tiflash+137056900]
dbms/src/Flash/Executor/DataStreamExecutor.cpp:44
0x8269089 DB::MPPTask::runImpl() [tiflash+136745097]
dbms/src/Flash/Mpp/MPPTask.cpp:408
0x1d06d28 auto DB::wrapInvocable<std::__1::function<void ()> >(bool, std::__1::function<void ()>&&)::‘lambda’()::operator()() [tiflash+30436648]
dbms/src/Common/wrapInvocable.h:36"] [source=“MPP<query:<query_ts:1687797841423682071, local_query_id:3, server_id:3328292, start_ts:442446077336223751>,task_id:3>”] [thread_id=288]

[2023/06/27 00:44:20.992 +08:00] [WARN] [MPPTaskManager.cpp:155] [“Begin to abort query: <query_ts:1687797841423682071, local_query_id:3, server_id:3328292, start_ts:442446077336223751>, abort type: ONERROR, reason: From MPP<query:<query_ts:1687797841423682071, local_query_id:3, server_id:3328292, start_ts:442446077336223751>,task_id:3>: Code: 0, e.displayText() = DB::TiFlashException: Memory limit (total) exceeded caused by ‘out of memory quota for data computing’ : would use 5.60 GiB for data computing (attempt to allocate chunk of 8388608 bytes), limit of memory for data computing: 5.60 GiB, e.what() = DB::TiFlashException,”] [thread_id=288]

[2023/06/27 00:44:20.992 +08:00] [WARN] [MPPTaskManager.cpp:198] ["Remaining task in query <query_ts:1687797841423682071, local_query_id:3, server_id:3328292, start_ts:442446077336223751> are: MPP<query:<query_ts:1687797841423682071, local_query_id:3, server_id:3328292, start_ts:442446077336223751>,task_id:3> "] [thread_id=288]

[2023/06/27 00:44:20.992 +08:00] [WARN] [MPPTask.cpp:511] [“Begin abort task: MPP<query:<query_ts:1687797841423682071, local_query_id:3, server_id:3328292, start_ts:442446077336223751>,task_id:3>, abort type: ONERROR”] [source=“MPP<query:<query_ts:1687797841423682071, local_query_id:3, server_id:3328292, start_ts:442446077336223751>,task_id:3>”] [thread_id=288]

[2023/06/27 00:44:20.992 +08:00] [WARN] [MPPTask.cpp:540] [“Finish abort task from running”] [source=“MPP<query:<query_ts:1687797841423682071, local_query_id:3, server_id:3328292, start_ts:442446077336223751>,task_id:3>”] [thread_id=288]

[2023/06/27 00:44:20.992 +08:00] [WARN] [MPPTaskManager.cpp:210] [“Finish abort query: <query_ts:1687797841423682071, local_query_id:3, server_id:3328292, start_ts:442446077336223751>”] [thread_id=288]

[2023/06/27 00:44:20.993 +08:00] [WARN] [MPPTaskManager.cpp:155] [“Begin to abort query: <query_ts:1687797841423682071, local_query_id:3, server_id:3328292, start_ts:442446077336223751>, abort type: ONCANCELLATION, reason: Receive cancel request from TiDB”] [thread_id=294]

[2023/06/27 00:44:20.993 +08:00] [WARN] [MPPTaskManager.cpp:165] [“<query_ts:1687797841423682071, local_query_id:3, server_id:3328292, start_ts:442446077336223751> does not found in task manager, skip abort”] [thread_id=294]

| username: TiDBer_NEw0xuKK | Original post link

I see that the store where TiFlash is located is offline. Should I check the TiFlash logs?

| username: tidb菜鸟一只 | Original post link

It looks like a memory issue. Have you set the maximum memory limit for TiFlash? How much did you set? Try increasing it a bit.

| username: TiDBer_NEw0xuKK | Original post link

Previously set to 7g, I changed it to 16g and tried it, and the error log is empty:


But the store is still in offline status:

| username: tidb菜鸟一只 | Original post link

Take another look at the tiflash logs, kubectl logs tidb-tiflash-0 tiflash -n tidb-admin.

| username: TiDBer_NEw0xuKK | Original post link

Could you please help take a look?

| username: tidb菜鸟一只 | Original post link

The logs look pretty normal. Is it still offline now?

| username: tidb菜鸟一只 | Original post link

Try manually bringing it online and offline with the command: curl -X POST http://127.0.0.1:2379/pd/api/v1/store/49317/state?state=Up?

| username: TiDBer_NEw0xuKK | Original post link

Following your instructions, it has been restored. Thank you very much. It might have gone offline at some point earlier :sweat_smile:

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.