What are the advantages and disadvantages of TiFlash compute-storage separation?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiFlash存算分离有哪些优劣势?

| username: 江湖故人

Starting from TiDB v7.0.0, TiFlash supports a storage-computation separation architecture. What are its advantages and disadvantages, and which business scenarios is it suitable for?

| username: 啦啦啦啦啦 | Original post link

The main reason is to save costs.

| username: xfworld | Original post link

Theoretically, computing power and storage capacity can be unlimited…

Integrated compute and storage will have some limitations.

| username: Jellybean | Original post link

The TiFlash storage-compute separation architecture is suitable for data analysis scenarios with significant peaks and troughs, or where only a portion of the data in massive storage requires computation. After separation, storage and computation for business can be independently scaled as needed. Just like tidb-server and tikv-server, they can be scaled up or down as required, greatly enhancing flexibility.

| username: zhaokede | Original post link

Although I haven’t actually deployed it, theoretically, horizontal scaling is more targeted for computing power.

| username: 随便改个用户名 | Original post link

Cheap storage + remote reading of data from other S3 storage

| username: 小龙虾爱大龙虾 | Original post link

Cloud users save more money :joy_cat:

| username: changpeng75 | Original post link

The advantage is that TiFlash uses columnar storage, which performs well under OLAP, and the data is asynchronously transferred from TiKV, so storage performance is not affected. The disadvantage is the additional storage space requirement.

| username: Kongdom | Original post link

It should mainly be to address more scenarios and meet more customer needs.

| username: 随缘天空 | Original post link

Advantages: The separation of storage and computation architecture allows for independent scaling based on the needs of computation and storage, which is beneficial for compute-intensive or storage-intensive application scenarios. It also allows users to optimize the use of storage and computation resources according to specific business needs, thereby achieving higher resource utilization.

Disadvantages: Performance overhead: The separation of storage and computation may introduce additional network latency and data transfer overhead, especially during large-scale data processing, which may affect overall performance. Maintaining data synchronization and consistency can become more complex, particularly in high-concurrency scenarios.

| username: Soysauce520 | Original post link

Cost reduction and efficiency improvement

| username: 哈喽沃德 | Original post link

Advantages:

  1. High-performance queries: Due to the characteristics of columnar storage, TiFlash can achieve higher query performance in certain scenarios, especially for operations that require scanning large amounts of data, such as aggregation and analytical queries.
  2. Better compression rate: Columnar storage can use more efficient compression algorithms, reducing storage space usage.
  3. Independent computing resources: By separating storage and computing, independent computing resources can be allocated to TiFlash, avoiding mutual interference between storage and computing, and improving the overall system’s stability and scalability.
  4. Elastic scalability: TiFlash can be horizontally scaled according to demand, supporting the dynamic addition or removal of TiFlash nodes to meet the growing data volume and query load.

Disadvantages:

  1. Data synchronization delay: Due to the design of the storage-compute separation architecture, TiFlash needs to synchronize data with the main node, which may lead to data synchronization delays, making it unsuitable for business scenarios with high real-time requirements.
  2. Relatively low write performance: Compared to the main node, TiFlash has lower write performance due to the characteristics of columnar storage. Therefore, for a large number of write operations, the main node is still a more suitable choice.
  3. Configuration and management complexity: The storage-compute separation architecture requires additional configuration and management, including data synchronization and query routing, which may increase system complexity and maintenance costs.
| username: 江湖故人 | Original post link

Thank you, teachers! I just looked at the official statement, and TiFlash’s disaggregated storage and compute architecture mentions two main advantages:

  1. Automatic management of hot and cold data. The disaggregated storage and compute architecture automatically caches frequently used data on local SSDs of compute nodes, while cold data is stored on cheaper S3, reducing storage costs.
  2. Stateless compute nodes allow for second-level scaling. When there are significant changes in the demand for computing resources, compared to the integrated TiFlash, the disaggregated architecture offers greater scalability, allowing compute nodes to scale up or down as needed, saving costs.

Additionally, I thought of a problem: if newly written data is read immediately, in a disaggregated storage and compute architecture, it will take a detour through S3, increasing IO costs.

If we consider TiFlash as an independent OLAP database, with the development of network transmission technology and storage network technology, as well as the prosperity of public clouds, it seems that TiFlash’s disaggregated storage and compute architecture is the trend.

| username: YuchongXU | Original post link

Reduce storage costs while ensuring computational performance.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.