After upgrading to 6.1, TiFlash service is abnormal

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 升级6.1后,TiFlash服务异常

| username: qihuajun

【TiDB Usage Environment】Production Environment
【TiDB Version】5.7.25-TiDB-v6.1.0
【Problem Encountered】After upgrading to 6.1.0 yesterday, the TiFlash service encountered a failure early this morning. The TiFlash service ran out of memory, causing the service to restart. After the cluster restarted this morning, queries on TiFlash became very slow after running for a while. Suspecting that it was caused by partition pruning of the partition table, we changed tidb_partition_prune_mode from dynamic to static, and it is currently running normally. We are still observing.

Additional Information: Currently, TiFlash nodes and TiKV are mixed.

We are unable to determine the root cause of the problem. Please help analyze.

【Attachments】Related logs and monitoring (https://metricstool.pingcap.com/)
Partial logs during the service outage early in the morning:

[2022/06/23 03:05:35.467 +08:00] [ERROR] [DAGDriver.cpp:197] ["DAGDriver:DB Exception: Allocator: Cannot malloc 1.00 MiB., errno: 12, strerror: Cannot allocate memory

   0x1d272d3	StackTrace::StackTrace() [tiflash+30569171]
            	dbms/src/Common/StackTrace.cpp:23
   0x1d248d6	DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) [tiflash+30558422]
            	dbms/src/Common/Exception.h:41
   0x1defa2e	DB::SharedQueryBlockInputStream::waitThread() [tiflash+31390254]
            	dbms/src/DataStreams/SharedQueryBlockInputStream.h:169
   0x1deed0f	DB::SharedQueryBlockInputStream::readImpl() [tiflash+31386895]
            	dbms/src/DataStreams/SharedQueryBlockInputStream.h:124
   0x6688b8a	DB::IProfilingBlockInputStream::read(DB::PODArray<unsigned char, 4096ul, Allocator<false>, 15ul, 16ul>*&, bool) [tiflash+107514762]
            	dbms/src/DataStreams/IProfilingBlockInputStream.cpp:75
   0x66888c4	DB::IProfilingBlockInputStream::read() [tiflash+107514052]
            	dbms/src/DataStreams/IProfilingBlockInputStream.cpp:43
   0x738e696	DB::ParallelInputsProcessor<DB::UnionBlockInputStream<(DB::StreamUnionMode)0, false>::Handler, (DB::StreamUnionMode)0>::loop(unsigned long) [tiflash+121169558]
            	dbms/src/DataStreams/ParallelInputsProcessor.h:298
   0x738e1f7	DB::ParallelInputsProcessor<DB::UnionBlockInputStream<(DB::StreamUnionMode)0, false>::Handler, (DB::StreamUnionMode)0>::thread(unsigned long) [tiflash+121168375]
            	dbms/src/DataStreams/ParallelInputsProcessor.h:236
   0x1db172c	auto DB::wrapInvocable<std::__1::function<void ()> >(bool, std::__1::function<void ()>&&)::'lambda'()::operator()() [tiflash+31135532]
            	dbms/src/Common/wrapInvocable.h:36
   0x1db1895	std::__1::packaged_task<void ()>::operator()() [tiflash+31135893]
            	/usr/local/bin/../include/c++/v1/future:2089
   0x1da0e68	DB::DynamicThreadPool::executeTask(std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_delete<DB::IExecutableTask> >&) [tiflash+31067752]
            	dbms/src/Common/DynamicThreadPool.cpp:101
   0x1da0d46	DB::DynamicThreadPool::dynamicWork(std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_delete<DB::IExecutableTask> >) [tiflash+31067462]
            	dbms/src/Common/DynamicThreadPool.cpp:142
   0x1da29bd	auto std::__1::thread DB::ThreadFactory::newThread<void (DB::DynamicThreadPool::*)(std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_delete<DB::IExecutableTask> >), DB::DynamicThreadPool*, std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_delete<DB::IExecutableTask> > >(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, void (DB::DynamicThreadPool::*&&)(std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_delete<DB::IExecutableTask> >), DB::DynamicThreadPool*&&, std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_delete<DB::IExecutableTask> >&&)::'lambda'(auto&&...)::operator()<DB::DynamicThreadPool*, std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_delete<DB::IExecutableTask> > >(auto&&...) const [tiflash+31074749]
            	dbms/src/Common/ThreadFactory.h:47
   0x1da2799	void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<void (DB::DynamicThreadPool::*)(std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_delete<DB::IExecutableTask> >), DB::DynamicThreadPool*, std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_delete<DB::IExecutableTask> > >(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, void (DB::DynamicThreadPool::*&&)(std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_delete<DB::IExecutableTask> >), DB::DynamicThreadPool*&&, std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_delete<DB::IExecutableTask> >&&)::'lambda'(auto&&...), DB::DynamicThreadPool*, std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_delete<DB::IExecutableTask> > > >(void*) [tiflash+31074201]
            	/usr/local/bin/../include/c++/v1/thread:291
  0x7f62c8ca4ea5	start_thread [libpthread.so.0+32421]
  0x7f62c87b79fd	__clone [libc.so.6+1042941]"] [thread_id=882]
[2022/06/23 03:05:35.474 +08:00] [ERROR] [Server.cpp:310] ["grpc:/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/contrib/grpc/src/cpp/thread_manager/thread_manager.cc, line number: 38, log msg : Could not create grpc_sync_server worker-thread"] [thread_id=59020]

Partial exception monitoring screenshots during slow queries in the morning:



For questions related to performance optimization and fault troubleshooting, please download the script and run it. Please select all and copy-paste the terminal output results for upload.

| username: muliping | Original post link

  1. Can you send the information from tiup cluster display?
  2. Can you use Clinic to upload the monitoring and log information?
    PingCAP Clinic 快速上手指南 | PingCAP 文档中心
| username: qihuajun | Original post link

Cluster information is as follows:

tiup is checking updates for component cluster ...
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.10.2/tiup-cluster display SQ-cluster
Cluster type:       tidb
Cluster name:       SQ-cluster
Cluster version:    v6.1.0
Deploy user:        tidb
SSH type:           builtin
Dashboard URL:      http://192.168.14.4:2379/dashboard
Grafana URL:        http://192.168.14.2:3000
ID                   Role          Host           Ports                            OS/Arch       Status  Data Dir                                   Deploy Dir
--                   ----          ----           -----                            -------       ------  --------                                   ----------
192.168.14.2:9093    alertmanager  192.168.14.2   9093/9094                        linux/x86_64  Up      /data/deploy/alertmanager/data             /data/deploy/alertmanager
192.168.14.2:3000    grafana       192.168.14.2   3000                             linux/x86_64  Up      -                                          /data/deploy
192.168.14.21:2379   pd            192.168.14.21  2379/2380                        linux/x86_64  Up      /data/deploy/pd/data                       /data/deploy/pd
192.168.14.22:2379   pd            192.168.14.22  2379/2380                        linux/x86_64  Up      /data/deploy/pd/data                       /data/deploy/pd
192.168.14.3:2379    pd            192.168.14.3   2379/2380                        linux/x86_64  Up|L    /data/deploy/data.pd                       /data/deploy
192.168.14.4:2379    pd            192.168.14.4   2379/2380                        linux/x86_64  Up|UI   /data/deploy/data.pd                       /data/deploy
192.168.14.5:2379    pd            192.168.14.5   2379/2380                        linux/x86_64  Up      /data/deploy/data.pd                       /data/deploy
192.168.14.2:9090    prometheus    192.168.14.2   9090/12020                       linux/x86_64  Up      /data/deploy/prometheus2.0.0.data.metrics  /data/deploy
192.168.14.21:4000   tidb          192.168.14.21  4000/10080                       linux/x86_64  Up      -                                          /data/deploy/tidb
192.168.14.22:4000   tidb          192.168.14.22  4000/10080                       linux/x86_64  Up      -                                          /data/deploy/tidb
192.168.14.3:4000    tidb          192.168.14.3   4000/10080                       linux/x86_64  Up      -                                          /data/deploy
192.168.14.4:4000    tidb          192.168.14.4   4000/10080                       linux/x86_64  Up      -                                          /data/deploy
192.168.14.5:4000    tidb          192.168.14.5   4000/10080                       linux/x86_64  Up      -                                          /data/deploy
192.168.14.23:9000   tiflash       192.168.14.23  9000/8123/3930/20170/20292/8234  linux/x86_64  Up      /data01/deploy/data                        /data01/deploy
192.168.14.24:9000   tiflash       192.168.14.24  9000/8123/3930/20170/20292/8234  linux/x86_64  Up      /data01/deploy/data                        /data01/deploy
192.168.14.25:9000   tiflash       192.168.14.25  9000/8123/3930/20170/20292/8234  linux/x86_64  Up      /data01/deploy/data                        /data01/deploy
192.168.14.26:9000   tiflash       192.168.14.26  9000/8123/3930/20170/20292/8234  linux/x86_64  Up      /data01/deploy/data                        /data01/deploy
192.168.14.27:9000   tiflash       192.168.14.27  9000/8123/3930/20170/20292/8234  linux/x86_64  Up      /data01/deploy/data                        /data01/deploy
192.168.14.28:9000   tiflash       192.168.14.28  9000/8123/3930/20170/20292/8234  linux/x86_64  Up      /data01/deploy/data                        /data01/deploy
192.168.14.10:20171  tikv          192.168.14.10  20171/20180                      linux/x86_64  Up      /data/deploy/data                          /data/deploy
192.168.14.11:20171  tikv          192.168.14.11  20171/20180                      linux/x86_64  Up      /data/deploy/data                          /data/deploy
192.168.14.23:20171  tikv          192.168.14.23  20171/20180                      linux/x86_64  Up      /data/deploy/data                          /data/deploy
192.168.14.24:20171  tikv          192.168.14.24  20171/20180                      linux/x86_64  Up      /data/deploy/data                          /data/deploy
192.168.14.25:20171  tikv          192.168.14.25  20171/20180                      linux/x86_64  Up      /data/deploy/data                          /data/deploy
192.168.14.26:20171  tikv          192.168.14.26  20171/20180                      linux/x86_64  Up      /data/deploy/data                          /data/deploy
192.168.14.27:20171  tikv          192.168.14.27  20171/20180                      linux/x86_64  Up      /data/deploy/data                          /data/deploy
192.168.14.28:20171  tikv          192.168.14.28  20171/20180                      linux/x86_64  Up      /data/deploy/data                          /data/deploy
192.168.14.6:20171   tikv          192.168.14.6   20171/20180                      linux/x86_64  Up      /data/deploy/data                          /data/deploy
192.168.14.7:20171   tikv          192.168.14.7   20171/20180                      linux/x86_64  Up      /data/deploy/data                          /data/deploy
192.168.14.8:20171   tikv          192.168.14.8   20171/20180                      linux/x86_64  Up      /data/deploy/data                          /data/deploy
192.168.14.9:20171   tikv          192.168.14.9   20171/20180                      linux/x86_64  Up      /data/deploy/data                          /data/deploy
Total nodes: 31

Complete logs are inconvenient to upload. Let me know if you need anything specific checked.

| username: qihuajun | Original post link

During the service anomaly, other error logs include:

[2022/06/23 09:40:53.708 +08:00] [ERROR] [DAGDriver.cpp:197] ["DAGDriver:DB Exception: Failed to write resp

       0x1d272d3\tStackTrace::StackTrace() [tiflash+30569171]
                \tdbms/src/Common/StackTrace.cpp:23
       0x1d248d6\tDB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) [tiflash+30558422]
                \tdbms/src/Common/Exception.h:41
       0x7b9a3f4\tDB::StreamWriter::write(tipb::SelectResponse&, unsigned short) [tiflash+129606644]
                \tdbms/src/Flash/Coprocessor/StreamWriter.h:60
       0x7b97d67\tvoid DB::StreamingDAGResponseWriter<std::__1::shared_ptr<DB::StreamWriter> >::batchWrite<true>() [tiflash+129596775]
                \tdbms/src/Flash/Coprocessor/StreamingDAGResponseWriter.cpp:314
       0x7370176\tDB::copyData(DB::IBlockInputStream&, DB::IBlockOutputStream&, std::__1::atomic<bool>*) [tiflash+121045366]
                \tdbms/src/DataStreams/copyData.cpp:78
       0x7af74d1\tDB::DAGDriver<true>::execute() [tiflash+128939217]
                \tdbms/src/Flash/Coprocessor/DAGDriver.cpp:142
       0x7ad6522\tDB::BatchCoprocessorHandler::execute() [tiflash+128804130]
                \tdbms/src/Flash/BatchCoprocessorHandler.cpp:78
       0x7ad2d3f\tstd::__1::__function::__func<DB::FlashService::BatchCoprocessor(grpc_impl::ServerContext*, coprocessor::BatchRequest const*, grpc_impl::ServerWriter<coprocessor::BatchResponse>*)::$_23, std::__1::allocator<DB::FlashService::BatchCoprocessor(grpc_impl::ServerContext*, coprocessor::BatchRequest const*, grpc_impl::ServerWriter<coprocessor::BatchResponse>*)::$_23>, grpc::Status ()>::operator()() [tiflash+128789823]
                \t/usr/local/bin/../include/c++/v1/__functional/function.h:345
       0x7ad5968\tstd::__1::__packaged_task_func<std::__1::function<grpc::Status ()>, std::__1::allocator<std::__1::function<grpc::Status ()> >, grpc::Status ()>::operator()() [tiflash+128801128]
                \t/usr/local/bin/../include/c++/v1/future:1687
       0x7ad5a65\tstd::__1::packaged_task<grpc::Status ()>::operator()() [tiflash+128801381]
                \t/usr/local/bin/../include/c++/v1/future:1960
       0x7c75e59\tThreadPool::worker() [tiflash+130506329]
                \tlibs/libcommon/src/ThreadPool.cpp:129
       0x7c76183\tvoid* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, ThreadPool::ThreadPool(unsigned long, std::__1::function<void ()>)::$_1> >(void*) [tiflash+130507139]
                \t/usr/local/bin/../include/c++/v1/thread:291
  0x7f5295810ea5\tstart_thread [libpthread.so.0+32421]
  0x7f52953239fd\t__clone [libc.so.6+1042941]"] [thread_id=18032]

[2022/06/23 09:40:51.204 +08:00] [ERROR] [<unknown>] ["pingcap.tikv: Failed4: Deadline Exceeded"] [thread_id=27897]
[2022/06/23 10:50:38.551 +08:00] [ERROR] [Server.cpp:310] ["grpc:/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/contrib/grpc/src/cpp/thread_manager/thread_manager.cc, line number: 38, log msg : Could not create grpc_sync_server worker-thread"] [thread_id=72014]

[2022/06/23 10:50:40.012 +08:00] [ERROR] [BaseDaemon.cpp:570] ["BaseDaemon:
       0x1ed2661\tfaultSignalHandler(int, siginfo_t*, void*) [tiflash+32319073]
                \tlibs/libdaemon/src/BaseDaemon.cpp:221
  0x7f5295818630\t<unknown symbol> [libpthread.so.0+63024]
  0x7f529525b387\tgsignal [libc.so.6+222087]
  0x7f529525ca78\t__GI_abort [libc.so.6+227960]
       0x88a843e\tgpr_malloc [tiflash+143295550]
                \tcontrib/grpc/src/core/lib/gpr/alloc.cc:34
       0x85b2f56\tgrpc_resource_user_alloc_slices(grpc_resource_user_slice_allocator*, unsigned long, unsigned long, grpc_slice_buffer*) [tiflash+140193622]
                \tcontrib/grpc/src/core/lib/iomgr/resource_quota.cc:1013
       0x85b5fba\ttcp_handle_read(void*, grpc_error*) [tiflash+140206010]
                \tcontrib/grpc/src/core/lib/iomgr/tcp_posix.cc:619
       0x86dc0f2\tread_action_locked(void*, grpc_error*) [tiflash+141410546]
                \tcontrib/grpc/src/core/ext/transport/chttp2/transport/chttp2_transport.cc:2590
       0x85a39dc\tgrpc_combiner_continue_exec_ctx() [tiflash+140130780]
                \tcontrib/grpc/src/core/lib/iomgr/combiner.cc:236
       0x85a623d\tgrpc_core::ExecCtx::Flush() [tiflash+140141117]
                \tcontrib/grpc/src/core/lib/iomgr/exec_ctx.cc:156
       0x85ac06b\tpollset_work(grpc_pollset*, grpc_pollset_worker**, long) [tiflash+140165227]
                \tcontrib/grpc/src/core/lib/iomgr/ev_epollex_linux.cc:1136
       0x85cca7e\tcq_pluck(grpc_completion_queue*, void*, gpr_timespec, void*) [tiflash+140298878]
                \tcontrib/grpc/src/core/lib/surface/completion_queue.cc:1281
       0x856ab11\tgrpc_impl::internal::ErrorMethodHandler<(grpc::StatusCode)8>::RunHandler(grpc::internal::MethodHandler::HandlerParameter const&) [tiflash+139897617]
                \tcontrib/grpc/include/grpcpp/impl/codegen/method_handler_impl.h:367
       0x8565260\tgrpc_impl::Server::SyncRequest::CallData::ContinueRunAfterInterception() [tiflash+139874912]
                \tcontrib/grpc/src/cpp/server/server_cc.cc:492
       0x8565080\tgrpc_impl::Server::SyncRequest::CallData::Run(std::__1::shared_ptr<grpc_impl::Server::GlobalCallbacks> const&, bool) [tiflash+139874432]
                \tcontrib/grpc/src/cpp/server/server_cc.cc:479
       0x8575bb7\tgrpc::ThreadManager::MainWorkLoop() [tiflash+139942839]
                \tcontrib/grpc/src/cpp/thread_manager/thread_manager.cc:211
       0x8576651\tgrpc::ThreadManager::WorkerThread::WorkerThread(grpc::ThreadManager*)::$_0::__invoke(void*) [tiflash+139945553]
                \tcontrib/grpc/src/cpp/thread_manager/thread_manager.cc:35
       0x88aca6c\tgrpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::'lambda'(void*)::__invoke(void*) [tiflash+143313516]
                \tcontrib/grpc/src/core/lib/gprpp/thd_posix.cc:110
  0x7f5295810ea5\tstart_thread [libpthread.so.0+32421]"] [thread_id=72017]
[2022/06/23 10:50:40.012 +08:00] [ERROR] [Server.cpp:310] ["grpc:/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/contrib/grpc/src/cpp/thread_manager/thread_manager.cc, line number: 38, log msg : Could not create grpc_sync_server worker-thread"] [thread_id=52616] 

Among them, the errors Could not create grpc_sync_server worker-thread and Deadline Exceeded appear frequently.

| username: muliping | Original post link

Could you download and send these 6 monitors?
https://metricstool.pingcap.com/#backup-with-dev-tools
overview,
tidb,
tikv-detail,
TiFlash-Summary,
TiFlash-Proxy-Summary,
TiFlash-Proxy-Details

| username: qihuajun | Original post link

The above are the monitoring files.

To supplement the time information: Our TiFlash node started having issues around 3:30 AM, causing the server to restart. After the restart, due to firewall issues, the nodes were unable to provide services. We turned off the firewall around 8 AM and restarted the entire cluster. Around 10 AM, a large number of slow queries began to appear, and all queries routed through TiFlash could not be processed. We turned off dynamic partition pruning around 12 PM, and after killing the backlog of queries, the TiFlash queries returned to normal.

| username: qihuajun | Original post link

Another TiFlash node crashed early this morning and has been failing to restart. Here are the new error logs:

[2022/06/24 02:32:28.445 +08:00] [ERROR] [DAGDriver.cpp:197] ["DAGDriver:DB Exception: Cannot read from file /data01/deploy/data/data/t_32925/log/page_1_0/page., errno: 29, strerror: Illegal seek

       0x1d272d3 StackTrace::StackTrace() [tiflash+30569171]
                 dbms/src/Common/StackTrace.cpp:23
       0x1d248d6 DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) [tiflash+30558422]
                 dbms/src/Common/Exception.h:41
       0x1d302ba DB::throwFromErrno(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, int) [tiflash+30606010]
                 dbms/src/Common/Exception.cpp:59
       0x786632b void DB::PageUtil::readFile<std::__1::shared_ptr<DB::RandomAccessFile> >(std::__1::shared_ptr<DB::RandomAccessFile>&, long, char const*, unsigned long, std::__1::shared_ptr<DB::ReadLimiter> const&, bool) [tiflash+126247723]
                 dbms/src/Storages/Page/PageUtil.h:270
       0x79f7e8c DB::PS::V2::PageFile::Reader::read(std::__1::vector<DB::PS::V2::PageFile::Reader::FieldReadInfo, std::__1::allocator<DB::PS::V2::PageFile::Reader::FieldReadInfo> >&, std::__1::shared_ptr<DB::ReadLimiter> const&) [tiflash+127893132]
                 dbms/src/Storages/Page/V2/PageFile.cpp:1039
       0x7a0c089 DB::PS::V2::PageStorage::readImpl(unsigned long, std::__1::vector<std::__1::pair<unsigned long, std::__1::vector<unsigned long, std::__1::allocator<unsigned long> > >, std::__1::allocator<std::__1::pair<unsigned long, std::__1::vector<unsigned long, std::__1::allocator<unsigned long> > > > > const&, std::__1::shared_ptr<DB::ReadLimiter> const&, std::__1::shared_ptr<DB::PageStorageSnapshot>, bool) [tiflash+127975561]
                 dbms/src/Storages/Page/V2/PageStorage.cpp:783
       0x7a8fbf0 DB::PageReaderImplNormal::read(std::__1::vector<std::__1::pair<unsigned long, std::__1::vector<unsigned long, std::__1::allocator<unsigned long> > >, std::__1::allocator<std::__1::pair<unsigned long, std::__1::vector<unsigned long, std::__1::allocator<unsigned long> > > > > const&) const [tiflash+128515056]
                 dbms/src/Storages/Page/PageStorage.cpp:113
       0x7a8d3f2 DB::PageReader::read(std::__1::vector<std::__1::pair<unsigned long, std::__1::vector<unsigned long, std::__1::allocator<unsigned long> > >, std::__1::allocator<std::__1::pair<unsigned long, std::__1::vector<unsigned long, std::__1::allocator<unsigned long> > > > > const&) const [tiflash+128504818]
                 dbms/src/Storages/Page/PageStorage.cpp:415
       0x7898948 DB::DM::ColumnFileTiny::readFromDisk(DB::PageReader const&, std::__1::vector<DB::DM::ColumnDefine, std::__1::allocator<DB::DM::ColumnDefine> > const&, unsigned long, unsigned long) const [tiflash+126454088]
                 dbms/src/Storages/DeltaMerge/ColumnFile/ColumnFileTiny.cpp:79
       0x7899124 DB::DM::ColumnFileTiny::fillColumns(DB::PageReader const&, std::__1::vector<DB::DM::ColumnDefine, std::__1::allocator<DB::DM::ColumnDefine> > const&, unsigned long, std::__1::vector<COWPtr<DB::IColumn>::immutable_ptr<DB::IColumn>, std::__1::allocator<COWPtr<DB::IColumn>::immutable_ptr<DB::IColumn> > >&) const [tiflash+126456100]
                 dbms/src/Storages/DeltaMerge/ColumnFile/ColumnFileTiny.cpp:115
       0x789a3f6 DB::DM::ColumnFileTinyReader::readRows(std::__1::vector<COWPtr<DB::IColumn>::mutable_ptr<DB::IColumn>, std::__1::allocator<COWPtr<DB::IColumn>::mutable_ptr<DB::IColumn> > >&, unsigned long, unsigned long, DB::DM::RowKeyRange const*) [tiflash+126460918]
                 dbms/src/Storages/DeltaMerge/ColumnFile/ColumnFileTiny.cpp:237
       0x788fa13 DB::DM::ColumnFileSetReader::readRows(std::__1::vector<COWPtr<DB::IColumn>::mutable_ptr<DB::IColumn>, std::__1::allocator<COWPtr<DB::IColumn>::mutable_ptr<DB::IColumn> > >&, unsigned long, unsigned long, DB::DM::RowKeyRange const*) [tiflash+126417427]
                 dbms/src/Storages/DeltaMerge/ColumnFile/ColumnFileSetReader.cpp:160
       0x78c54ab DB::DM::DeltaValueReader::readRows(std::__1::vector<COWPtr<DB::IColumn>::mutable_ptr<DB::IColumn>, std::__1::allocator<COWPtr<DB::IColumn>::mutable_ptr<DB::IColumn> > >&, unsigned long, unsigned long, DB::DM::RowKeyRange const*) [tiflash+126637227]
                 dbms/src/Storages/DeltaMerge/Delta/Snapshot.cpp:117
       0x77ebc05 void DB::DM::DeltaMergeBlockInputStream<DB::DM::DeltaValueReader, DB::DM::DTCompactedEntries<55ul, 20ul, 3ul>::Iterator, false>::next<false, false>(std::__1::vector<COWPtr<DB::IColumn>::mutable_ptr<DB::IColumn>, std::__1::allocator<COWPtr<DB::IColumn>::mutable_ptr<DB::IColumn> > >&, unsigned long&) [tiflash+125746181]
                 dbms/src/Storages/DeltaMerge/DeltaMerge.h:0
       0x77ea997 DB::DM::DeltaMergeBlockInputStream<DB::DM::DeltaValueReader, DB::DM::DTCompactedEntries<55ul, 20ul, 3ul>::Iterator, false>::doRead() [tiflash+125741463]
                 dbms/src/Storages/DeltaMerge/DeltaMerge.h:242
       0x77ea55c DB::DM::DeltaMergeBlockInputStream<DB::DM::DeltaValueReader, DB::DM::DTCompactedEntries<55ul, 20ul, 3ul>::Iterator, false>::read() [tiflash+125740380]
                 dbms/src/Storages/DeltaMerge/DeltaMerge.h:174
       0x77ec9b9 DB::DM::DMRowKeyFilterBlockInputStream<true>::read() [tiflash+125749689]
                 dbms/src/Storages/DeltaMerge/RowKeyFilter.h:178
       0x77e17d5 DB::DM::readNextBlock(std::__1::shared_ptr<DB::IBlockInputStream> const&) [tiflash+125704149]
                 dbms/src/Storages/DeltaMerge/DeltaMergeHelpers.h:253
       0x7804405 DB::DM::DMVersionFilterBlockInputStream<0>::initNextBlock() [tiflash+125846533]
                 dbms/src/Storages/DeltaMerge/DMVersionFilterBlockInputStream.h:133
       0x7803505 DB::DM::DMVersionFilterBlockInputStream<0>::read(DB::PODArray<unsigned char, 4096ul, Allocator<false>, 15ul, 16ul>*&, bool) [tiflash+125842693]
                 dbms/src/Storages/DeltaMerge/DMVersionFilterBlockInputStream.cpp:53
       0x77b06d3 DB::DM::DMSegmentThreadInputStream::readImpl(DB::PODArray<unsigned char, 4096ul, Allocator<false>, 15ul, 16ul>*&, bool) [tiflash+125503187]
                 dbms/src/Storages/DeltaMerge/DMSegmentThreadInputStream.h:124
       0x77b05c8 DB::DM::DMSegmentThreadInputStream::readImpl() [tiflash+125502920]
                 dbms/src/Storages/DeltaMerge/DMSegmentThreadInputStream.h:85
       0x6688b8a DB::IProfilingBlockInputStream::read(DB::PODArray<unsigned char, 4096ul, Allocator<false>, 15ul, 16ul>*&, bool) [tiflash+107514762]
                 dbms/src/DataStreams/IProfilingBlockInputStream.cpp:75
       0x66888c4 DB::IProfilingBlockInputStream::read() [tiflash+107514052]
                 dbms/src/DataStreams/IProfilingBlockInputStream.cpp:43
       0x739e49b DB::ExpressionBlockInputStream::readImpl() [tiflash+121234587]
                 dbms/src/DataStreams/ExpressionBlockInputStream.cpp:50
       0x1dc8c32 DB::IProfilingBlockInputStream::readImpl(DB::PODArray<unsigned char, 4096ul, Allocator<false>, 15ul, 16ul>*&, bool) [tiflash+31231026]
                 dbms/src/DataStreams/IProfilingBlockInputStream.h:232
       0x6688a0e DB::IProfilingBlockInputStream::read(DB::PODArray<unsigned char, 4096ul, Allocator<false>, 15ul, 16ul>*&, bool) [tiflash+107514382]
                 dbms/src/DataStreams/IProfilingBlockInputStream.cpp:73
       0x73a0ef0 DB::FilterBlockInputStream::readImpl() [tiflash+121245424]
                 dbms/src/DataStreams/FilterBlockInputStream.cpp:91
       0x6688b8a DB::IProfilingBlockInputStream::read(DB::PODArray<unsigned char, 4096ul, Allocator<false>, 15ul, 16ul>*&, bool) [tiflash+107514762]
                 dbms/src/DataStreams/IProfilingBlockInputStream.cpp:75
       0x66888c4 DB::IProfilingBlockInputStream::read() [tiflash+107514052]
                 dbms/src/DataStreams/IProfilingBlockInputStream.cpp:43
       0x739e49b DB::ExpressionBlockInputStream::readImpl() [tiflash+121234587]
                 dbms/src/DataStreams/ExpressionBlockInputStream.cpp:50
       0x6688b8a DB::IProfilingBlockInputStream::read(DB::PODArray<unsigned char, 4096ul, Allocator<false>, 15ul, 16ul>*&, bool) [tiflash+107514762]
                 dbms/src/DataStreams/IProfilingBlockInputStream.cpp:75"] [thread_id=47105]

[2022/06/24 05:46:25.680 +08:00] [ERROR] [Exception.cpp:85] ["DB::DM::DeltaMergeStore::DeltaMergeStore(DB::Context &, bool, const DB::String &, const DB::String &, DB::TableID, const DB::DM::ColumnDefines &, const DB::DM::ColumnDefine &, bool, size_t, const DB::DM::DeltaMergeStore::Settings &):Code: 49, e.displayText() = DB::Exception: PageFile binary version not match {MergingReader of PageFile_7_0, type: Formal, sequence no: 0, meta offset: 25098, data offset: 9991564} [unknown_version=0] [file=/data01/deploy/data/data/t_15303/meta/page_7_0/meta], e.what() = DB::Exception, Stack trace:


       0x1d272d3 StackTrace::StackTrace() [tiflash+30569171]
                 dbms/src/Common/StackTrace.cpp:23
       0x1d248d6 DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) [tiflash+30558422]
                 dbms/src/Common/Exception.h:41
       0x79f53a3 DB::PS::V2::PageFile::MetaMergingReader::moveNext(unsigned int*) [tiflash+127882147]
                 dbms/src/Storages/Page/V2/PageFile.cpp:585
       0x7a06501 DB::PS::V2::PageStorage::restore() [tiflash+127952129]
                 dbms/src/Storages/Page/V2/PageStorage.cpp:289
       0x781ec0f DB::DM::StoragePool::restore() [tiflash+125955087]
                 dbms/src/Storages/DeltaMerge/StoragePool.cpp:379
       0x7788ca1 DB::DM::DeltaMergeStore::DeltaMergeStore(DB::Context&, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, long, std::__1::vector<DB::DM::ColumnDefine, std::__1::allocator<DB::DM::ColumnDefine> > const&, DB::DM::ColumnDefine const&, bool, unsigned long, DB::DM::DeltaMergeStore::Settings const&) [tiflash+125340833]
                 dbms/src/Storages/DeltaMerge/DeltaMergeStore.cpp:243
       0x771283d DB::StorageDeltaMerge::getAndMaybeInitStore() [tiflash+124856381]
                 dbms/src/Storages/StorageDeltaMerge.cpp:1545
       0x77187fe DB::StorageDeltaMerge::getSchemaSnapshotAndBlockForDecoding(DB::TableDoubleLockHolder<false> const&, bool) [tiflash+124880894]
                 dbms/src/Storages/StorageDeltaMerge.cpp:910
       0x7916b6d DB::writeRegionDataToStorage(DB::Context&, DB::RegionPtrWithBlock const&, std::__1::vector<std::__1::tuple<DB::RawTiDBPK, unsigned char, unsigned long, std::__1::shared_ptr<DB::StringObject<false> const> >, std::__1::allocator<std::__1::tuple<DB::RawTiDBPK, unsigned char, unsigned long, std::__1::shared_ptr<DB::StringObject<false> const> > > >&, Poco::Logger*)::$_0::operator()(bool) const [tiflash+126970733]
                 dbms/src/Storages/Transaction/PartitionStreams.cpp:123
       0x7912f32 DB::writeRegionDataToStorage(DB::Context&, DB::RegionPtrWithBlock const&, std::__1::vector<std::__1::tuple<DB::RawTiDBPK, unsigned char, unsigned long, std::__1::shared_ptr<DB::StringObject<false> const> >, std::__1::allocator<std::__1::tuple<DB::RawTiDBPK, unsigned char, unsigned long, std::__1::shared_ptr<DB::StringObject<false> const> > > >&, Poco::Logger*) [tiflash+126955314]
                 dbms/src/Storages/Transaction/PartitionStreams.cpp:169
       0x7912c9c DB::RegionTable::writeBlockByRegion(DB::Context&, DB::RegionPtrWithBlock const&, std::__1::vector<std::__1::tuple<DB::RawTiDBPK, unsigned char, unsigned long, std::__1::shared_ptr<DB::StringObject<false> const> >, std::__1::allocator<std::__1::tuple<DB::RawTiDBPK, unsigned char, unsigned long, std::__1::shared_ptr<DB::StringObject<false> const> > > >&, Poco::Logger*, bool) [tiflash+126954652]
                 dbms/src/Storages/Transaction/PartitionStreams.cpp:348
       0x79351ff DB::Region::handleWriteRaftCmd(DB::WriteCmdsView const&, unsigned long, unsigned long, DB::TMTContext&) [tiflash+127095295]
                 dbms/src/Storages/Transaction/Region.cpp:712
       0x7901910 DB::KVStore::handleWriteRaftCmd(DB::WriteCmdsView const&, unsigned long, unsigned long, unsigned long, DB::TMTContext&) [tiflash+126884112]
                 dbms/src/Storages/Transaction/KVStore.cpp:285
       0x791b822 HandleWriteRaftCmd [tiflash+126990370]
                 dbms/src/Storages/Transaction/ProxyFFI.cpp:92
  0x7f82f6f0a6e0 raftstore::store::fsm::apply::ApplyDelegate$LT$EK$GT$::process_raft_cmd::h4c8bba241d5ee4d4 [libtiflash_proxy.so+28567264]
  0x7f82f6f07d6c raftstore::store::fsm::apply::ApplyDelegate$LT$EK$GT$::handle_raft_committed_entries::h478c4bad834ddec7 [libtiflash_proxy.so+28556652]
  0x7f82f6f27f91 raftstore::store::fsm::apply::ApplyFsm$LT$EK$GT$::handle_apply::he4c0b1137ae7cad9 [libtif
| username: muliping | Original post link

The monitoring overview needs to be sent again, the previous one had no data.
Is the current issue about resolving the abnormal crash and restart of TiFlash?

| username: qihuajun | Original post link

Yes, we want to figure out why TiFlash keeps having issues, as we are worried it might crash at any time.

By the way, I have some questions about TiFlash configuration:

  1. I noticed that the values of these TiFlash configurations are quite conservative. We haven’t set these parameters ourselves. Are these parameters automatically calculated by the system? Is there anything we can optimize?

  2. Why do many of TiFlash’s configuration names include proxy? What does this proxy mean? Are these proxy related configurations core configurations of TiFlash?

    tiflash 192.168.14.25:3930 raftstore-proxy.readpool.unified.max-thread-count 4
    tiflash 192.168.14.28:3930 raftstore-proxy.memory-usage-limit 29471976106

| username: tidb狂热爱好者 | Original post link

Is MPP enabled?

| username: qihuajun | Original post link

tidb_allow_mpp = on
tidb_enforce_mpp = off
| username: JaySon-Huang | Original post link

@qihuajun I would like to know the following information:

  1. From which specific version was the upgrade to 6.1.0 made, and when exactly did the upgrade to 6.1.0 occur?
  2. For the 6 machines (192.168.14.23~28) that are mixed with TiKV and TiFlash, what are their configurations?
  3. “After turning off dynamic partition pruning and killing the backlog queries, TiFlash queries returned to normal.” Based on this description, it is possible that dynamic partition pruning of 6.1.0 was used, but the statistics were not manually updated, leading to changes in the execution plan on TiFlash, which in turn caused OOM issues.

Did you enable dynamic partition pruning only after upgrading to 6.1.0? After enabling it, did you manually update the statistics?


Additionally, from the monitoring, I see that TiKV consistently uses around 110GB of memory, and the TiFlash process restarts when its memory reaches approximately 30GB. If the machine configuration has around 150GB of memory, with TiKV and TiFlash mixed on the same machine, allocating 110GB to TiKV leaves too little available memory for TiFlash. This resource allocation is not very reasonable. AP queries, when processing complex queries, will compete with the mixed TiKV for CPU, memory, and IO resources, affecting TP business. It is generally recommended to deploy TiKV and TiFlash on separate instances.

| username: JaySon-Huang | Original post link

Has TiFlash ever restarted due to OOM before this June 24th error occurred? Is this node still in a state of continuous restart errors?

| username: qihuajun | Original post link

  1. Upgraded from 5.4.1 to 6.1.0
  2. Machine configuration: CPU 64 cores, memory 320G, NVME SSD for storage, with NUMA resources half allocated to TiKV and half to TiFlash.
  3. Dynamic pruning was only enabled after upgrading to 6.1.0. We manually updated the statistics for two tables, but after the entire TiFlash cluster’s queries slowed down, those two tables couldn’t be queried either.

The total memory is 320G, with half allocated to TiKV and half to TiFlash, but TiFlash only uses up to a little over 30G.

| username: qihuajun | Original post link

This has never happened before. We forced this node offline, cleared the data, and brought it back online. So far, everything is normal.

| username: JaySon-Huang | Original post link

From the log you provided on June 24, it appears that the data on the disk of this instance has been corrupted. Upon detection after startup, it reported an error and exited.

Without the live environment and preceding or following logs, it is difficult to determine the exact cause of the disk data corruption. If it happens again, you can retain the data and logs to see if further analysis is possible.

| username: JaySon-Huang | Original post link

The total memory is 320G, but tiflash only uses more than 30G at most, which seems quite strange. Can you confirm how you configured the NUMA resource allocation?

| username: qihuajun | Original post link

It was specified as numa_node in the configuration example.

| username: qihuajun | Original post link

Please see my previous reply.

| username: JaySon-Huang | Original post link

Could you please confirm the changes in the query plan for SQL involving partitioned tables when dynamic pruning is enabled and disabled respectively? For example, whether there is an unreasonable join order, or what specific warnings occur after executing explain.

set @@session.tidb_partition_prune_mode = 'static';
explain <your sql>;
set @@session.tidb_partition_prune_mode = 'dynamic';
explain <your sql>;