Error in Compute-Storage Separation in TiFlash v7.1.1

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tiflash v7.1.1存算分离报错

| username: TiDBer_Lee

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] v7.1.1
In a storage-compute separation architecture deployment, the following error keeps appearing in the logs of the compute node:

[2023/09/26 01:48:55.096 +00:00] [ERROR] [DiagnosticsService.cpp:57] 
["TiFlashRaftProxyHelper is null, `DiagnosticsService::server_info` is useless"] 
[source=DiagnosticsService] [thread_id=351]

I wonder if anyone else has encountered this issue.

| username: TiDBer_oHSwKxOH | Original post link

Post the architecture diagram.

| username: 有猫万事足 | Original post link

TiFlashRaftProxyHelper inherits from RaftStoreProxyFFIHelper.

The role of RaftStoreProxyFFIHelper is:

“TiFlash and Proxy will each encapsulate FFI functions into Helper objects and then hold each other’s Helper pointers. RaftStoreProxyFFIHelper is the handle that Proxy provides for TiFlash to call. It encapsulates the RaftStoreProxy object. Through this handle, TiFlash can perform tasks such as ReadIndex, parsing SST, obtaining Region-related information, and Encryption.”

In other words, the handle provided by a certain TiKV for TiFlash to call is empty. Consequently, TiFlash cannot perform tasks such as ReadIndex, parsing SST, obtaining Region-related information, and Encryption through this handle.

It seems that TiFlash synchronization might be abnormal. It is recommended to check if there are other logs.

| username: TiDBer_Lee | Original post link

Currently, only the TiFlash compute node has error messages;
The tables have all been resynchronized and their statuses are available;
There is an alert on the PD node:
[grpclog.go:60] [“transport: http2Server.HandleStreams failed to read frame: read tcp 10.60.71.229:2379->10.60.76.129:47208: read: connection reset by peer”]

| username: TiDBer_小阿飞 | Original post link

This question is too difficult, I don’t know how to solve it! Waiting for an expert to explain and provide the best answer.

| username: 有猫万事足 | Original post link

Before this error, it indeed checks whether this node is a TiFlash compute node.

“tiflash compute node should be managed by AutoScaler instead of PD, this grpc should not be called by AutoScaler for now”

The general idea is that the compute node should be managed by AutoScaler instead of PD. The fact that it reports an error at line 57 indicates that the check at line 43 did not have its intended effect. Could it be a bug?

However, it seems that even if it is a bug, it is likely just an alarming log output without any real impact. Essentially, it failed to recognize itself as a compute node and then output a log that shouldn’t have been output. Let’s wait for other experts to take a look. I’m out of ideas.

| username: ajin0514 | Original post link

You can try using the new version.