Read: Connection reset by peer

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: read: connection reset by peer

| username: wzf0072

[TiDB Usage Environment] Production Environment
[TiDB Version]: v6.5.2
[Reproduction Path] What operations were performed when the issue occurred
[Encountered Issue: Problem Phenomenon and Impact]
Application error:


Application error code:

Cause: java.sql.SQLException: rpc error: code = Unavailable desc = error reading from server: read tcp 172.16.89.80:53590->172.16.89.85:3930: read: connection reset by peer

; uncategorized SQLException; SQL state [HY000]; error code [1105]; rpc error: code = Unavailable desc = error reading from server: read tcp 172.16.89.80:53590->172.16.89.85:3930: read: connection reset by peer; nested exception is java.sql.SQLException: rpc error: code = Unavailable desc = error reading from server: read tcp 172.16.89.80:53590->172.16.89.85:3930: read: connection reset by peer
Node description: 172.16.89.81 (TiDB Server), 172.16.89.85 (TiFlash)
[Attachments: Screenshots/Logs/Monitoring]

TiDB Server (172.16.89.80) error logs (a lot of errors):
[2023/10/12 08:40:00.388 +08:00] [ERROR] [ddl_tiflash_api.go:396] [“get tiflash sync progress failed”] [error=“Get "http://172.16.89.85:20292/tiflash/sync-status/21743\”: dial tcp 172.16.89.85:20292: connect: connection refused"] [tableID=21743] [IsPartition=false]
[2023/10/12 08:40:00.389 +08:00] [ERROR] [tiflash_manager.go:93] [“Fail to get peer status from TiFlash.”] [tableID=21743]
[2023/10/12 08:40:00.390 +08:00] [ERROR] [tiflash_manager.go:119] [“Fail to get peer count from TiFlash.”] [tableID=21743]
[2023/10/12 08:40:00.390 +08:00] [ERROR] [ddl_tiflash_api.go:396] [“get tiflash sync progress failed”] [error=“Get "http://172.16.89.85:20292/tiflash/sync-status/21743\”: dial tcp 172.16.89.85:20292: connect: connection refused"] [tableID=21743] [IsPartition=false]
[2023/10/12 08:40:00.391 +08:00] [ERROR] [tiflash_manager.go:93] [“Fail to get peer status from TiFlash.”] [tableID=21743]
[2023/10/12 08:40:00.391 +08:00] [ERROR] [tiflash_manager.go:119] [“Fail to get peer count from TiFlash.”] [tableID=21743]

TiFlash node (172.16.89.85) logs (tiflash_error.log):
[2023/10/12 08:39:56.457 +08:00] [WARN] [CoprocessorHandler.cpp:143] [“RegionException: region 531389, message: NOT_FOUND”] [source=CoprocessorHandler] [thread_id=98]
[2023/10/12 08:39:56.457 +08:00] [WARN] [CoprocessorHandler.cpp:143] [“RegionException: region 531997, message: NOT_FOUND”] [source=CoprocessorHandler] [thread_id=84]
[2023/10/12 08:39:56.457 +08:00] [WARN] [CoprocessorHandler.cpp:143] [“RegionException: region 533871, message: NOT_FOUND”] [source=CoprocessorHandler] [thread_id=81]
[2023/10/12 08:39:56.458 +08:00] [WARN] [CoprocessorHandler.cpp:143] [“RegionException: region 533055, message: NOT_FOUND”] [source=CoprocessorHandler] [thread_id=83]
[2023/10/12 08:41:17.801 +08:00] [WARN] [ExchangeReceiver.cpp:210] [“MakeReader fail. retry time: 0”] [source=“MPPquery:444877052576792581:22,task ExchangeReceiver_339 tunnel20+22”] [thread_id=341]
[2023/10/12 08:41:18.816 +08:00] [WARN] [ExchangeReceiver.cpp:210] [“MakeReader fail. retry time: 1”] [source=“MPPquery:444877052576792581:22,task ExchangeReceiver_339 tunnel20+22”] [thread_id=341]
[2023/10/12 08:41:20.236 +08:00] [WARN] [ExchangeReceiver.cpp:210] [“MakeReader fail. retry time: 2”] [source=“MPPquery:444877052576792581:22,task ExchangeReceiver_339 tunnel20+22”] [thread_id=341]
[2023/10/12 08:41:23.080 +08:00] [WARN] [ExchangeReceiver.cpp:210] [“MakeReader fail. retry time: 3”] [source=“MPPquery:444877052576792581:22,task ExchangeReceiver_339 tunnel20+22”] [thread_id=341]
[2023/10/12 08:41:25.593 +08:00] [WARN] [ExchangeReceiver.cpp:210] [“MakeReader fail. retry time: 4”] [source=“MPPquery:444877052576792581:22,task ExchangeReceiver_339 tunnel20+22”] [thread_id=341]
[2023/10/12 08:41:26.621 +08:00] [WARN] [ExchangeReceiver.cpp:210] [“MakeReader fail. retry time: 5”] [source=“MPPquery:444877052576792581:22,task ExchangeReceiver_339 tunnel20+22”] [thread_id=341]
[2023/10/12 08:41:27.634 +08:00] [WARN] [ExchangeReceiver.cpp:210] [“MakeReader fail. retry time: 6”] [source=“MPPquery:444877052576792581:22,task ExchangeReceiver_339 tunnel20+22”] [thread_id=341]
[2023/10/12 08:41:27.885 +08:00] [WARN] [MPPTaskManager.cpp:152] [“Begin to abort query: 444877052576792581, abort type: ONCANCELLATION, reason: Receive cancel request from TiDB”] [thread_id=97]
[2023/10/12 08:41:27.885 +08:00] [WARN] [MPPTaskManager.cpp:195] ["Remaining task in query 444877052576792581 are: MPPquery:444877052576792581:3,task MPPquery:444877052576792581:6,task MPPquery:444877052576792581:16,task MPPquery:444877052576792581:22,task MPPquery:444877052576792581:19,task MPPquery:444877052576792581:9,task MPPquery:444877052576792581:5,task MPPquery:444877052576792581:18,task MPPquery:444877052576792581:21,task MPPquery:444877052576792581:1,task MPPquery:444877052576792581:13,task "] [thread_id=97]
[2023/10/12 08:41:27.885 +08:00] [WARN] [MPPTask.cpp:471] [“Begin abort task: MPPquery:444877052576792581:3,task, abort type: ONCANCELLATION”] [source=MPPquery:444877052576792581:3,task] [thread_id=97]
[2023/10/12 08:41:27.885 +08:00] [WARN] [MPPTask.cpp:500] [“Finish abort task from running”] [source=MPPquery:444877052576792581:3,task] [thread_id=97]
[2023/10/12 08:41:27.885 +08:00] [WARN] [MPPTask.cpp:471] [“Begin abort task: MPPquery:444877052576792581:6,task, abort type: ONCANCELLATION”] [source=MPPquery:444877052576792581:6,task] [thread_id=97]
[2023/10/12 08:41:27.886 +08:00] [WARN] [TiRemoteBlockInputStream.h:136] [“remote reader meets error: Receiver state: ERROR, error message: Read error message from mpp packet: Receive cancel request from TiDB”] [source=“TiRemote(ExchangeReceiver) ExchangeReceiver MPPquery:444877052576792581:22,task ExchangeReceiver_195”] [thread_id=385]
[2023/10/12 08:41:27.886 +08:00] [WARN] [MPPTask.cpp:500] [“Finish abort task from running”] [source=MPPquery:444877052576792581:6,task] [thread_id=97]
[2023/10/12 08:41:27.897 +08:00] [WARN] [MPPTask.cpp:471] [“Begin abort task: MPPquery:444877052576792581:16,task, abort type: ONCANCELLATION”] [source=MPPquery:444877052576792581:16,task] [thread_id=97]
[2023/10/12 08:41:27.897 +08:00] [WARN] [MPPTask.cpp:500] [“Finish abort task from running”] [source=MPPquery:444877052576792581:16,task] [thread_id=97]
[2023/10/12 08:41:27.897 +08:00] [WARN] [MPPTask.cpp:471] [“Begin abort task: MPPquery:444877052576792581:22,task, abort type: ONCANCELLATION”] [source=MPPquery:444877052576792581:22,task] [thread_id=97]
[2023/10/12 08:41:27.897 +08:00] [WARN] [TiRemoteBlockInputStream.h:136] [“remote reader meets error: Receiver state: ERROR, error message: Read error message from mpp packet: Receive cancel request from TiDB”] [source=“TiRemote(ExchangeReceiver) ExchangeReceiver MPPquery:444877052576792581:22,task ExchangeReceiver_195”] [thread_id=393]
[2023/10/12 08:41:27.897 +08:00] [WARN] [TiRemoteBlockInputStream.h:136] [“remote reader meets error: Receiver state: CANCELED, error message: Read error message from mpp packet: Receive cancel request from TiDB”] [source=“TiRemote(ExchangeReceiver) ExchangeReceiver MPPquery:444877052576792581:22,task ExchangeReceiver_285”] [thread_id=524]
[2023/10/12 08:41:27.898 +08:00] [WARN] [TiRemoteBlockInputStream.h:136] [“remote reader meets error: Receiver state: CANCELED, error message: Read error message from mpp packet: Receive cancel request from TiDB”] [source=“TiRemote(ExchangeReceiver) ExchangeReceiver MPPquery:444877052576792581:22,task ExchangeReceiver_285”] [thread_id=533]
[2023/10/12 08:41:27.898 +08:00] [WARN] [TiRemoteBlockInputStream.h:136] [“remote reader meets error: Receiver state: ERROR, error message: Read error message from mpp packet: Receive cancel request from TiDB”] [source=“TiRemote(ExchangeReceiver) ExchangeReceiver MPPquery:444877052576792581:22,task ExchangeReceiver_195”] [thread_id=382]
[2023/10/12 08:41:27.898 +08:00] [WARN] [TiRemoteBlockInputStream.h:136] [“remote reader meets error: Receiver state: ERROR, error message: Read error message from mpp packet: Receive cancel request from TiDB”] [source=“TiRemote(ExchangeReceiver) ExchangeReceiver MPPquery:444877052576792581:22,task ExchangeReceiver_195”] [thread_id=1827]
[2023/10/12 08:41:27.898 +08:00] [WARN] [TiRemoteBlockInputStream.h:136] [“remote reader meets error: Receiver state: CANCELED, error message: Read error message from mpp packet: Receive cancel request from TiDB”] [source=“TiRemote(ExchangeReceiver) ExchangeReceiver MPPquery:444877052576792581:22,task ExchangeReceiver_285”] [thread_id=515]
[2023/10/12 08:41:27.898 +08:00] [WARN] [TiRemoteBlockInputStream.h:136] [“remote reader meets error: Receiver state: ERROR, error message: Read error message from mpp packet: Receive cancel request from TiDB”] [source=“TiRemote(ExchangeReceiver) ExchangeReceiver MPPquery:444877052576792581:22,task ExchangeReceiver_216”] [thread_id=497]
[2023/10/12 08:41:27.898 +08:00] [WARN] [TiRemoteBlockInputStream.h:136] [“remote reader meets error: Receiver state: CANCELED, error message: Read error message from mpp packet: Receive cancel request from TiDB”] [source=“TiRemote(ExchangeReceiver) ExchangeReceiver MPPquery:444877052576792581:22,task ExchangeReceiver_285”] [thread_id=518]

High Request Duration during error period

[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page

| username: ti-tiger | Original post link

Check if the network connection of the TiFlash node is normal, if there are firewalls or other factors obstructing the communication between the TiDB Server and TiFlash, or if the TiFlash data files are corrupted.

| username: wzf0072 | Original post link

The TiFlash service is down.

| username: Fly-bird | Original post link

The network is down, it should be a network issue.

| username: wzf0072 | Original post link

The uptime of both TiFlash nodes has reset to zero, indicating that the service has automatically restarted. We are currently investigating the cause of the restart.

| username: wzf0072 | Original post link

Fault Handling Process
Fault Phenomenon:


TiFlash nodes frequently restart

TiDB Server nodes frequently restart


During the issue, a large number of high-memory-consuming SQLs were being executed.
Solution:
Set the maximum SQL memory usage to 10GB:
SET global tidb_mem_quota_query = 10 << 30;
Prohibit SQL execution that uses more than 10GB of memory:
set global tidb_mem_oom_action=‘CANCEL’;

Original configuration:
SET tidb_mem_quota_query = 24 << 30;
set global tidb_mem_oom_action=‘LOG’;

System Operation After Configuration Change:


After limiting memory usage, TiDB Server and TiFlash no longer restart;


SQL execution is automatically interrupted after exceeding 10GB of memory usage


The issue was resolved after SQL optimization went live.

| username: 像风一样的男子 | Original post link

You need to optimize this SQL properly.

| username: wzf0072 | Original post link

We use TiDB for reporting and analysis. It definitely needs some optimization.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.