TiFlash Error: Error Code: 1105. rpc error: code = Unavailable desc = error reading from server: EOF 22.359 sec

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tiflash报错:Error Code: 1105. rpc error: code = Unavailable desc = error reading from server: EOF 22.359 sec

| username: Jaimyjie

5 tables, large data, with TiFlash replicas. Error during execution: Error Code: 1105. rpc error: code = Unavailable desc = error reading from server: EOF 22.359 sec

CPU and memory usage are relatively low during query (Server: 32C+128G)

Dashboard TiFlash logs:
2023-03-20 09:36:54 (UTC+08:00)

TiFlash 192.168.0.105:3930

[kv.rs:671] [“KvService::batch_raft send response fail”] [err=RemoteStopped]

2023-03-20 09:36:54 (UTC+08:00)

TiFlash 192.168.0.105:3930

[raft_client.rs:562] [“connection aborted”] [addr=192.168.0.106:20170] [receiver_err=“Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "Socket closed", details: }))”] [sink_error=“Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: "Socket closed", details: })))”] [store_id=152]

2023-03-20 09:36:54 (UTC+08:00)

TiFlash 192.168.0.105:3930

[raft_client.rs:858] [“connection abort”] [addr=192.168.0.106:20170] [store_id=152]

2023-03-20 09:36:54 (UTC+08:00)

TiFlash 192.168.0.104:3930

[kv.rs:671] [“KvService::batch_raft send response fail”] [err=RemoteStopped]

2023-03-20 09:36:54 (UTC+08:00)

TiFlash 192.168.0.104:3930

[raft_client.rs:562] [“connection aborted”] [addr=192.168.0.106:20170] [receiver_err=“Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "Socket closed", details: }))”] [sink_error=“Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: "Socket closed", details: })))”] [store_id=152]

2023-03-20 09:36:54 (UTC+08:00)

TiFlash 192.168.0.104:3930

[raft_client.rs:858] [“connection abort”] [addr=192.168.0.106:20170] [store_id=152]

2023-03-20 09:36:59 (UTC+08:00)

TiFlash 192.168.0.105:3930

[raft_client.rs:821] [“wait connect timeout”] [addr=192.168.0.106:20170] [store_id=152]

2023-03-20 09:36:59 (UTC+08:00)

TiFlash 192.168.0.104:3930

[raft_client.rs:821] [“wait connect timeout”] [addr=192.168.0.106:20170] [store_id=152]

2023-03-20 09:37:04 (UTC+08:00)

TiFlash 192.168.0.105:3930

[raft_client.rs:821] [“wait connect timeout”] [addr=192.168.0.106:20170] [store_id=152]

2023-03-20 09:37:04 (UTC+08:00)

TiFlash 192.168.0.104:3930

[raft_client.rs:821] [“wait connect timeout”] [addr=192.168.0.106:20170] [store_id=152]

| username: xfworld | Original post link

Please provide a tiup cluster configuration list and status.

You’ve posted so much, but I still don’t understand the relationships between them, nor do I know what these IPs are for (why not fill in the content according to the posting requirements?).

| username: Jaimyjie | Original post link

The image is not visible. Please provide the text you need translated.

| username: Jaimyjie | Original post link

After optimizing TiFlash according to TiFlash 性能调优 | PingCAP 文档中心, the server’s CPU and memory are not being fully utilized. The execution process was interrupted, and the TiFlash node crashed directly.

| username: xfworld | Original post link

Is the network bandwidth sufficient?

The error in the logs indicates that the network communication for several TiFlash nodes was interrupted.

What service is running on 192.168.0.106:20170?

| username: Jaimyjie | Original post link

TiFlash, I will test version 6.1 tomorrow.

| username: Jaimyjie | Original post link

My query involves joining 5 tables, each with around 2 million rows, and many of them are full table scans. Both the 6.1 and 6.5 tests caused the flash nodes to crash. It might be that the query isn’t well-supported or optimized for this kind of data condition. I’m not going to tinker with it anymore and will switch to data governance and cleaning solutions. Thank you, everyone.

| username: jansu-dev | Original post link

Have you already given up? Judging by the context, the root cause is still unclear. Did you give up without knowing the reason?

| username: Jaimyjie | Original post link

Based on my data and query conditions, during tests 6.1 and 6.5, the flash nodes crashed. It seems there is also a keeplive issue, indicating a problem. The error message suggests that the flash nodes cannot be connected, even though my hardware usage is not very high. I personally understand that there are mainly three possibilities: 1. Many conditions in my data do not have indexes, leading to full table scans. 2. Lack of in-depth optimization. 3. The current system cannot support it. The project is quite urgent, and since we are using open-source software without service support, we can’t have high expectations. Relying on daily communication in the group affects the project’s progress, haha.

| username: xfworld | Original post link

What configuration?

| username: Jaimyjie | Original post link

All nodes: 32C+128G+SSD

| username: xfworld | Original post link

Let’s take a look at the execution plan. Don’t waste such good hardware resources. :heart_eyes:

| username: Jaimyjie | Original post link

However, this is another environment with lower configuration. The high-configuration environment has been reinstalled and has not been deployed yet.

| username: xfworld | Original post link

Most of the operations are still aggregation operations, with fewer pushdowns, which is quite challenging for TiDB.

| username: Jaimyjie | Original post link

Yes, do you have any good suggestions for optimization?

| username: xfworld | Original post link

Adjust the structure, optimize the indexes, and try to use push-down methods to achieve parallel computing and speed up the process.

| username: Jaimyjie | Original post link

The data from the old system was transferred through DM. Due to significant changes in the old system, it was suggested to abandon it and switch to another solution.