On TiFlash Compute-Storage Separation

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 关于TiFlash存算分离

| username: TiDBer_Lee

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version] v7.5.0

  • I encountered the following issues while testing the disaggregated storage and compute architecture of TiFlash:

    • Relevant knowledge points: TiFlash 存算分离架构与 S3 支持 | PingCAP 文档中心
    • Two write nodes
    • The table has two replicas
    • When one of my write nodes fails, the table becomes inaccessible:
      • ERROR 1105 (HY000): other error for mpp stream: Code: 0, e.displayText() = DB::Exception: EstablishDisaggTask Failed14: failed to connect to all addresses, e.what() = DB::Exception,
    • When checking the distribution of the table’s regions, I can see that all regions have replicas on the surviving write node
  • How should I quickly fix this issue? Originally, the TiFlash architecture was highly available, so does this architecture have a single point of failure?

  • Note: I am using the new disaggregated storage and compute architecture, with TiFlash data stored on S3. TiFlash is divided into two roles: write nodes and read nodes. My issue is that when the write node fails or is manually shut down, the table becomes inaccessible. If anyone has tested this, please refer to the official documentation.

| username: 双开门变频冰箱 | Original post link

Has the node already been kicked out?

| username: TiDBer_vfJBUcxl | Original post link

The link and image you provided are not accessible. Please provide the text you need translated.

| username: dba远航 | Original post link

This is probably caused by a leader election failure.

| username: TiDBer_Lee | Original post link

I am talking about TiFlash.

| username: lemonade010 | Original post link

Using the MPP algorithm, it is recommended to directly take the faulty machine offline or disable MPP.

| username: TiDBer_Lee | Original post link

This issue is not related to MPP.

| username: zhanggame1 | Original post link

Have you taken the unavailable TiFlash node offline?

| username: TiDBer_Lee | Original post link

I manually stopped a TiFlash node that was writing.

| username: zhanggame1 | Original post link

I haven’t used the storage-compute separation architecture, let’s see if other experts have used it.

| username: TiDBer_5Vo9nD1u | Original post link

Is there an issue with one of the nodes?

| username: AnotherCalvinNeo | Original post link

The user provided detailed operating instructions at TiFlash存算分离写节点高可用问题 · Issue #8774 · pingcap/tiflash · GitHub. We currently suspect this is a TiFlash bug. Would it be possible for you to provide the TiFlash wn logs?

| username: TiDBer_Lee | Original post link

I’m sorry, but I can’t access external links. Please provide the text you need translated.

| username: AnotherCalvinNeo | Original post link

My steps are:
Step 1: tiup cluster stop xxx --- stop one of the write nodes
At this point, all queries are unavailable
Step 2: Change the table's two replicas to one replica
At this point, you can see that the table TIKV_REGION_STATUS still has unavailable replicas
At this point, querying the table is still unavailable
Step 3: Restart the available write node
At this point, querying is unavailable
Step 4: Start the write node that was stopped in the first step
At this point, querying is available

I want to ask when approximately did the first and second steps happen?

Additionally, from the logs, I see a lot of errors showing abnormal connections with AWS. It shows Access Denied errors, which might be due to improper permission configuration.
| username: AnotherCalvinNeo | Original post link

It is recommended to check if the following configurations are fully completed:

  - host: 
| username: zhang_2023 | Original post link

The node has been removed.