Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: 咨询一个问题 tidb的分布式数据库用分区表。怎么使用才会比普通表快?
【TiDB Usage Environment】Production Environment / Testing / PoC
【TiDB Version】5.4.3
【Reproduction Path】What operations were performed when the issue occurred
【Encountered Issue: Issue Phenomenon and Impact】
Can you provide specific examples of quick usage methods?
【Resource Configuration】
【Attachments: Screenshots / Logs / Monitoring】
In MySQL, which is based on B+ trees, the primary function of partitioned tables is to reduce the size of a single tree, lower the total number of levels, improve query efficiency, or avoid MySQL’s single table size limit (which might be around 20TB, though I don’t remember the exact figure). However, in distributed databases like TiDB, which are based on LSM trees, data is typically scattered by dividing regions. For normal OLTP queries, adding an index theoretically would be faster and more convenient than converting to partitioned tables. The scenario I can think of might be for OLAP purposes. For example, if each processing involves all data for a month or a day, then directly reading the entire partition based on time might reduce the time for secondary table lookups compared to querying based on an index, potentially making it faster. (But I haven’t experimented with this; it’s just a guess.)
Since TiDB itself is a distributed table, can I understand it as a single partition? Then what is the significance of partitioned tables within TiDB?
Partitioned tables in TiDB are similar to those in traditional relational databases, helping to avoid performance jitter caused by large transactions such as archiving.
Additionally, it can be used in conjunction with placement rules for hot and cold data separation.
Partition types in TiDB. The currently supported types include Range Partition, List Partition, List COLUMNS Partition, and Hash Partition. Range Partition, List Partition, and List COLUMNS Partition can be used to address performance issues caused by large-scale deletions in business scenarios, supporting quick partition deletion. Hash Partition can be used to distribute data in scenarios with high write volumes.
All SQL operations that include partition keys generally perform better than those on regular tables.
I think the biggest advantage of partitioned tables is cleaning up historical data (time partition)
But in fact, for OLAP scenarios, isn’t TiDB more recommended to use TiFlash directly? TiFlash is columnar storage, and partitions don’t apply to TiFlash, right?
I have conducted performance tests on partitioned and non-partitioned tables in version v6. In actual performance, partitioned tables are slower than non-partitioned tables. During execution, there are many more cop operations. The advantage of adding partitions is the ability to quickly delete related data. Since we have a lot of table data, deleting with the delete command is very painful and cannot reclaim the occupied space. Using partitions, we can directly drop the partition. I consulted relevant technical personnel, and there was no better way to handle it. I am using the range partitioning method.
I have also been thinking about this question. What is the greatest role of partitioned tables in a distributed system?
There is also partition pruning.
If partition pruning is not possible, the performance might be worse than not partitioning at all.
This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.