Is TiDB suitable for using partitioned tables?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb适合使用分区表吗

| username: Mirror

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] v7.5
[Reproduction Path] What operations were performed when the issue occurred
[Encountered Issue: Issue Phenomenon and Impact] We have a very large table stored in TiDB with billions of records, but the application says it only needs to store data for six months. Can we use partitioned tables, and are there any usage limitations for partitioned tables?
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots / Logs / Monitoring]

| username: lemonade010 | Original post link

You can refer to the relevant documentation to see the partitioning methods and limitations that suit you: 分区表 | PingCAP 文档中心

| username: YuchongXU | Original post link

Not suitable

| username: Kongdom | Original post link

According to my understanding of the requirements, a history table should be created, and data older than six months should be moved to the history table. There is no need to use partition tables to distinguish between data older than six months and data within the last six months.

| username: DBAER | Original post link

In terms of data volume, either partitioned tables or requiring the business side to maintain a date table are both acceptable.

| username: Mirror | Original post link

It’s not easy to handle historical tables. Suppose there’s a table called Table A, and if it exceeds 1/2, you rename it to A_xxxx and rebuild Table A. However, this means that if users need to query data from 1/1, they have to query A_xxxx, and the front-end application needs to be modified, which they don’t agree with. Do you have any good solution suggestions?

| username: Mirror | Original post link

Will partitioned tables be slow to query? Does anyone have actual production experience to share?

| username: Mirror | Original post link

Is the query slow?

| username: zhimadi | Original post link

One advantage of using partitioned tables in a distributed system is the ease of management, such as deleting by partition.

| username: Mirror | Original post link

Yes, but I’m worried that the query will be slow.

| username: porpoiselxj | Original post link

Personally, I believe that apart from tables that require regular cleaning of expired data, others do not need to use partitioned tables. Since TiDB uses LSM structure to store data and indexes, a large amount of data will not have a significant impact on query efficiency. Partitioned tables will have limited impact on performance unless you can include partition fields in the conditions every time you use them, and it might even reduce performance.

| username: zhanggame1 | Original post link

There are many bugs with partitioned tables, such as issues with partition statistics collection in version 7.5.0, and merge statistics often fail.

| username: zhanggame1 | Original post link

It may slow down, and partition table statistics are prone to issues.
Refer to my previous test

| username: Mirror | Original post link

The main purpose of partitioning is still to archive and clean up data. The application says that data older than six months is not needed, and the boss requires it to be cleaned up.

| username: dba远航 | Original post link

Using partitioned tables can improve query performance, but only if all queries use the partition fields for filtering; otherwise, performance will decrease instead of increasing.

| username: zhanggame1 | Original post link

If the amount of data to be cleaned is large, it is more convenient to use partition tables. Otherwise, deleting data will be very slow, and reclaiming space will take even longer.

| username: WinterLiu | Original post link

The person above is right. The advantage of partitioned tables is that when deleting expired data, you can directly drop the partition, which is faster than delete.

| username: residentevil | Original post link

Analyzing from multiple scenarios, if the business table can be split according to a certain time field, it is recommended to create a single table in the format of year, month, and day. Advantages: Deleting a single table does not affect other tables, and the business logic is also simple.

| username: Kongdom | Original post link

What I mean is that there are two tables with the same structure, Table A and Table B. Table A only retains data for six months, and business data is written into Table A in real-time. At the end of each day, data older than six months is migrated from Table A to Table B. This way, Table B contains all the data older than six months. This ensures data retention without affecting the use of Table A.

| username: tony5413 | Original post link

Have you measured the data volume for the past six months?