How to Use Full Table Group Query in TiDB Partitioned Tables

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb分区表如何使用全表分组查询

| username: TiDB小萌新

[TiDB Usage Environment] Production Environment
[TiDB Version] 5.4

Two tables have the same data, each with around 90 million records. One is a partitioned table, and the other is a regular table. When using the partitioned table to perform a group by, the count cannot be obtained, but it works with the regular table. How can I specify that the partitioned table does not use partition union all and instead performs a full table query?

| username: Billmay表妹 | Original post link

There are some differences between partitioned tables and non-partitioned tables when using GROUP BY in TiDB. When using GROUP BY on a partitioned table, you need to specify the partition column; otherwise, you will encounter issues similar to what you described, where the data in the partitioned table cannot be correctly aggregated. Here is an example SQL statement that can correctly aggregate data on a partitioned table:

SELECT partition_column, COUNT(*) FROM partitioned_table GROUP BY partition_column;

Here, partition_column is the name of the partition column, and partitioned_table is the name of the partitioned table. This statement will perform a full table scan on the partitioned table but will only aggregate data for each partition, not the entire table. If you want to use GROUP BY on a non-partitioned table, you can directly use a SQL statement like the following:

SELECT column, COUNT(*) FROM table GROUP BY column;

Here, column is the column name in the non-partitioned table, and table is the name of the non-partitioned table. This statement will perform a full table scan on the entire table and aggregate data for each group. If you want to perform a full table group query on a partitioned table, you can use UNION ALL to combine the results of each partition, for example:

SELECT partition_column, COUNT(*) FROM partitioned_table WHERE partition_column = 'partition1' GROUP BY partition_column
UNION ALL
SELECT partition_column, COUNT(*) FROM partitioned_table WHERE partition_column = 'partition2' GROUP BY partition_column
UNION ALL
...

This statement will perform a separate GROUP BY for each partition and then combine the results. However, this method requires manually specifying the name of each partition, which is cumbersome. Therefore, it is recommended to use the first method on a partitioned table, i.e., specifying the partition column for GROUP BY.

| username: TiDB小萌新 | Original post link

I am using partitioning by creation time, but my groupby field does not use the partition field. Using union all to merge results in OOM, and the final result set has around 50 million rows.

| username: 有猫万事足 | Original post link

The internal implementation of partitioned tables is that each partition has an independent tableid. That is, as you mentioned, each partition is a table, and they are unioned all together.

This implementation cannot be changed.
However, your issue is not unsolvable. The key problem is that the hashagg of the partitioned table is executed in TiDB and not pushed down to TiFlash.
So, I think you can consider trying MPP. MPP should directly compute a result for you without going through TiDB hashagg.

You can try enabling MPP at the session level to see if there is any improvement.