How should we understand TiFlash's MPP? (Or how to better use TiFlash)

translator_bot · June 23, 2024, 4:04am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiFlash的MPP应该怎么理解？(或者说如何更好使用TiFlash)

| username: dba-kit

The direct translation of MPP (Massively Parallel Processing) is “大规模并行处理,” but the TiDB official documentation introduces two aspects:

When discussing how to enable MPP, it mentions, “TiFlash supports query execution in MPP mode, which involves cross-node data exchange (data shuffle process) in computation.” Using MPP Mode
In the 5.0.0 Release Note, the introduction is also relatively vague. MPP Architecture

TiDB introduces the MPP architecture through TiFlash nodes. This allows large table join queries to be shared and completed by different TiFlash nodes.
When MPP mode is enabled, TiDB will decide whether to use the MPP framework for computation based on cost. In MPP mode, table joins will redistribute the computation pressure to various TiFlash execution nodes by performing data redistribution (Exchange operation) on the JOIN Key, thereby accelerating the computation. Furthermore, with the previously supported aggregation computation by TiFlash, TiDB can push down the computation of a query to the TiFlash MPP cluster in MPP mode, leveraging the distributed environment to accelerate the entire execution process and significantly improve the speed of analytical queries.

These two sections discuss how to enable MPP and what MPP is. However, they do not introduce the application scenarios targeted by MPP or how to better use MPP. If there is only one TiFlash node, does it help with query optimization?

Is it that MPP can better utilize the advantages of TiFlash columnar storage, targeting some complex query dimensions (for example, an order table might have 5 or 6 query dimensions)? If each field is indexed, it would significantly impact TiDB’s write performance. Since TiFlash is columnar storage, it naturally creates single-column indexes for each field, making it more friendly for such complex queries. Furthermore, does it mean that as long as the table is synchronized to TiFlash, the single-column indexes on this table can be discarded?
Additionally, what are the advantages of multiple TiFlash nodes over a single TiFlash node? Is it that if there are multiple TiFlash nodes, a single table’s Region might exist on multiple TiFlash nodes, resulting in higher efficiency?

translator_bot · June 23, 2024, 4:04am

| username: wink | Original post link

TiFlash only has coarse-grained indexes, so if you remove the indexes from TiKV, point queries won’t be possible. If there is only one TiFlash node, it can still help with aggregation queries that involve full table scans, but the benefit is limited.

Additionally, while TiFlash is columnar storage, it lacks fine-grained indexes and only has coarse-grained ones. Therefore, it cannot retrieve specific columns or rows at minimal cost and needs to scan many rows. If you remove single-column indexes from the table, and you don’t have point queries or small-range filter queries, you can remove the indexes from TiKV. However, subsequent queries on this column will essentially involve large-scale scans in TiFlash, and the concurrency cannot be too high. Thus, it cannot handle high-concurrency TP-type queries.

translator_bot · June 23, 2024, 4:04am

| username: wink | Original post link

The unanswered question is, MPP actually distributes the tasks that require join calculations from a single TiDB to multiple TiFlash nodes. So if the intermediate result of the join is relatively large, using MPP is the right choice.

translator_bot · June 23, 2024, 4:04am

| username: flow-PingCAP | Original post link

Even with only one TiFlash node, it is usually beneficial for AP analysis performance. The reason is that TiFlash’s storage engine and computing engine are optimized for AP computation, making it more efficient compared to TiKV and TiDB. The number of TiFlash replicas is not related to query performance; it is only related to high availability. Even with a single replica, TiDB will try to distribute the table’s data across multiple TiFlash nodes to utilize the concurrent capabilities of multiple nodes. Therefore, adding TiFlash nodes can always improve query performance.

translator_bot · June 23, 2024, 4:04am

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.