On TiFlash Compute-Storage Separation

| username: TiDBer_Lee

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version] v7.5.0

  • I encountered the following issues while testing the disaggregated storage and compute architecture of TiFlash:

    • Relevant knowledge points: TiFlash 存算分离架构与 S3 支持 | PingCAP 文档中心
    • Two write nodes
    • The table has two replicas
    • When one of my write nodes fails, the table becomes inaccessible:
      • ERROR 1105 (HY000): other error for mpp stream: Code: 0, e.displayText() = DB::Exception: EstablishDisaggTask Failed14: failed to connect to all addresses, e.what() = DB::Exception,
    • When checking the distribution of the table’s regions, I can see that all regions have replicas on the surviving write node
  • How should I quickly fix this issue? Originally, the TiFlash architecture was highly available, so does this architecture have a single point of failure?

  • Note: I am using the new disaggregated storage and compute architecture, with TiFlash data stored on S3. TiFlash is divided into two roles: write nodes and read nodes. My issue is that when the write node fails or is manually shut down, the table becomes inaccessible. If anyone has tested this, please refer to the official documentation.

Has the node already been kicked out?

This is probably caused by a leader election failure.

I am talking about TiFlash.

Using the MPP algorithm, it is recommended to directly take the faulty machine offline or disable MPP.

This issue is not related to MPP.

Have you taken the unavailable TiFlash node offline?

I manually stopped a TiFlash node that was writing.

I haven’t used the storage-compute separation architecture, let’s see if other experts have used it.

Is there an issue with one of the nodes?

The user provided detailed operating instructions at TiFlash存算分离写节点高可用问题 · Issue #8774 · pingcap/tiflash · GitHub. We currently suspect this is a TiFlash bug. Would it be possible for you to provide the TiFlash wn logs?

My steps are:
Step 1: tiup cluster stop xxx --- stop one of the write nodes
At this point, all queries are unavailable
Step 2: Change the table's two replicas to one replica
At this point, you can see that the table TIKV_REGION_STATUS still has unavailable replicas
At this point, querying the table is still unavailable
Step 3: Restart the available write node
At this point, querying is unavailable
Step 4: Start the write node that was stopped in the first step
At this point, querying is available

I want to ask when approximately did the first and second steps happen?

Additionally, from the logs, I see a lot of errors showing abnormal connections with AWS. It shows Access Denied errors, which might be due to improper permission configuration.
It is recommended to check if the following configurations are fully completed:

  - host: 
The node has been removed.