A large number of slow queries appear in TiDB, and Tiflash latency suddenly increases

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB出现大量慢查询Tiflash 延迟突然增大

| username: wzf0072

[TiDB Usage Environment] Production Environment
[TiDB Version] v6.5.2
[Reproduction Path] Operations performed that led to the issue
[Encountered Issue: Issue Phenomenon and Impact]


Tiflash Requery duration 8.33min
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page

[Attachments: Screenshots/Logs/Monitoring]
Intermittent performance issues occur in the system, with the database around 300GB, and TiFlash latency suddenly increases.

| username: tidb狂热爱好者 | Original post link

You were killed by slow SQL.

| username: 孤君888 | Original post link

Check the top 10 slow logs SQL? Then look at the disk IO utilization?

| username: wzf0072 | Original post link

IO performance is normal, but the TOP10 SQL is inaccurate. Under normal circumstances, SQL that executes in less than 1 second takes several minutes when performance issues occur.

| username: wzf0072 | Original post link

Stuck again.

| username: wzf0072 | Original post link

How can I intervene to make the Tiflash status quickly return to normal?
Even if I kill all the SQL executed by the TiDB Server, the Request Duration and Request Handle Duration of Tiflash do not immediately decrease, and the delay still lasts for more than ten minutes.

| username: tidb菜鸟一只 | Original post link

Are all the SQL queries related to TiFlash? Check the TiFlash logs to see if there’s anything there.

| username: zhanggame1 | Original post link

Check if there are any issues with the disk, such as a significant drop in performance, and also review the SQL execution plan to see if it’s correct.

| username: wzf0072 | Original post link

The tiflash_error.log contains a large number of errors during the failure period: read tso: 441829476756619266 is smaller than tidb gc safe point
[2023/05/30 19:45:56.013 +08:00] [ERROR] [DAGDriver.cpp:203] [“DB Exception: query id: 4f0aff33-5217-49de-bd03-c2deaa74bf37, read tso: 441829476756619266 is smaller than tidb gc safe point: 441829639391019008: (while creating InputStreams from storage db_7644.t_8412, table_id: 8412)\n\n 0x17221ce\tDB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, int) [tiflash+24256974]\n \tdbms/src/Common/Exception.h:46\n 0x6887d3c\tDB::checkReadTso(unsigned long, DB::TMTContext const&, DB::Context const&, DB::Context const&) [tiflash+109608252]\n \tdbms/src/Storages/StorageDeltaMerge.cpp:628\n 0x6887fd9\tDB::StorageDeltaMerge::parseMvccQueryInfo(DB::MvccQueryInfo const&, unsigned int, DB::Context const&, std::__1::shared_ptrDB::Logger const&) [tiflash+109608921]\n \tdbms/src/Storages/StorageDeltaMerge.cpp:648\n 0x688a7de\tDB::StorageDeltaMerge::read(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > > > const&, DB::SelectQueryInfo const&, DB::Context const&, DB::QueryProcessingStage::Enum&, unsigned long, unsigned int, std::__1::shared_ptrDB::DM::ScanContext const&) [tiflash+109619166]\n \tdbms/src/Storages/StorageDeltaMerge.cpp:837\n 0x6d71bcc\tDB::DAGStorageInterpreter::executeImpl(DB::DAGPipeline&) [tiflash+114760652]\n \tdbms/src/Flash/Coprocessor/DAGStorageInterpreter.cpp:326\n 0x6e778d4\tDB::PhysicalTableScan::transformImpl(DB::DAGPipeline&, DB::Context&, unsigned long) [tiflash+115833044]\n \tdbms/src/Flash/Planner/plans/PhysicalTableScan.cpp:59\n 0x6e74cca\tDB::PhysicalProjection::transformImpl(DB::DAGPipeline&, DB::Context&, unsigned long) [tiflash+115821770]\n \tdbms/src/Flash/Planner/plans/PhysicalProjection.cpp:139\n 0x6e3ffc1\tDB::Planner::execute() [tiflash+115605441]\n \tdbms/src/Flash/Planner/Planner.cpp:36\n 0x6d29a79\tDB::(anonymous namespace)::executeDAG(DB::IQuerySource&, DB::Context&, bool) [tiflash+114465401]\n \tdbms/src/Flash/executeQuery.cpp:88\n 0x6d29570\tDB::executeQuery(DB::Context&, bool) [tiflash+114464112]\n \tdbms/src/Flash/executeQuery.cpp:108\n 0x6d24ffa\tDB::DAGDriver::execute() [tiflash+114446330]\n \tdbms/src/Flash/Coprocessor/DAGDriver.cpp:94\n 0x6cde502\tDB::CoprocessorHandler::execute() [tiflash+114156802]\n \tdbms/src/Flash/CoprocessorHandler.cpp:113\n 0x6cfb2f0\tstd::__1::__function::__func<DB::FlashService::Coprocessor(grpc::ServerContext*, coprocessor::Request const*, coprocessor::Response*)::$_18, std::__1::allocator<DB::FlashService::Coprocessor(grpc::ServerContext*, coprocessor::Request const*, coprocessor::Response*)::$_18>, grpc::Status ()>::operator()() [tiflash+114275056]\n \t/usr/local/bin/…/include/c++/v1/__functional/function.h:345\n 0x6cfac48\tstd::__1::__packaged_task_func<std::__1::function<grpc::Status ()>, std::__1::allocator<std::__1::function<grpc::Status ()> >, grpc::Status ()>::operator()() [tiflash+114273352]\n \t/usr/local/bin/…/include/c++/v1/future:1687\n 0x6cfae05\tstd::__1::packaged_task<grpc::Status ()>::operator()() [tiflash+114273797]\n \t/usr/local/bin/…/include/c++/v1/future:1960\n 0x6f0e314\tvoid* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_deletestd::__1::__thread_struct >, ThreadPool::ThreadPool(unsigned long, std::__1::function<void ()>)::$_0> >(void*) [tiflash+116450068]\n \t/usr/local/bin/…/include/c++/v1/thread:291\n 0x7f4da726bea5\tstart_thread [libpthread.so.0+32421]\n 0x7f4da6670b0d\tclone [libc.so.6+1043213]”] [source=DAGDriver] [thread_id=1081]

| username: wzf0072 | Original post link

Disk I/O related parameters are all normal. During the failure period, TiFlash was stuck, and the execution of SQL using TiFlash became slow.

| username: wzf0072 | Original post link

Regarding the “read tso: 441829476756619266 is smaller than tidb gc safe point” error, I have changed the tidb_gc_life_time from 10 minutes to 1 hour. I will observe if the error still occurs.

| username: wzf0072 | Original post link

This issue occurred again last night, and there are some differences in the TIflash error log: region does not exist.

| username: 小王同学Plus | Original post link

Do you have the complete TiFlash logs?
When the issue occurs intermittently, is it always the same business that is affected?

| username: wzf0072 | Original post link

The business is still in the trial operation stage, so it is impossible to determine whether it is caused by business operations. After changing tidb_gc_life_time from 10 minutes to 1 hour, it happened again on June 2nd. Now tidb_gc_life_time has been set to 4 hours.

| username: wzf0072 | Original post link

Add some Tiflash-Summary raft-related information.

| username: redgame | Original post link

Optimize a few first and see.

| username: cy6301567 | Original post link

Check the slow SQL and the SQL with the highest memory usage at the time of the issue, and also look at the KV CPU and memory usage.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.