Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.Original topic: TiFlash的MPP应该怎么理解?(或者说如何更好使用TiFlash)
The direct translation of MPP (Massively Parallel Processing) is “大规模并行处理,” but the TiDB official documentation introduces two aspects:
- When discussing how to enable MPP, it mentions, “TiFlash supports query execution in MPP mode, which involves cross-node data exchange (data shuffle process) in computation.” Using MPP Mode
- In the 5.0.0 Release Note, the introduction is also relatively vague. MPP Architecture
TiDB introduces the MPP architecture through TiFlash nodes. This allows large table join queries to be shared and completed by different TiFlash nodes.
When MPP mode is enabled, TiDB will decide whether to use the MPP framework for computation based on cost. In MPP mode, table joins will redistribute the computation pressure to various TiFlash execution nodes by performing data redistribution (Exchange operation) on the JOIN Key, thereby accelerating the computation. Furthermore, with the previously supported aggregation computation by TiFlash, TiDB can push down the computation of a query to the TiFlash MPP cluster in MPP mode, leveraging the distributed environment to accelerate the entire execution process and significantly improve the speed of analytical queries.
These two sections discuss how to enable MPP and what MPP is. However, they do not introduce the application scenarios targeted by MPP or how to better use MPP. If there is only one TiFlash node, does it help with query optimization?
Is it that MPP can better utilize the advantages of TiFlash columnar storage, targeting some complex query dimensions (for example, an order table might have 5 or 6 query dimensions)? If each field is indexed, it would significantly impact TiDB’s write performance. Since TiFlash is columnar storage, it naturally creates single-column indexes for each field, making it more friendly for such complex queries. Furthermore, does it mean that as long as the table is synchronized to TiFlash, the single-column indexes on this table can be discarded?
Additionally, what are the advantages of multiple TiFlash nodes over a single TiFlash node? Is it that if there are multiple TiFlash nodes, a single table’s Region might exist on multiple TiFlash nodes, resulting in higher efficiency?