What are the solutions for splitting large tables in TiDB big data storage?

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB大数据存储拆分大表有什么解决方案?

| username: Carl

The data in a table can change by tens of millions of rows every day. Over time, querying and storage become very slow. What solutions are available?

Currently, we can think of regular partitioning and Hash partitioning. Which of these two options is better, and are there any other solutions?

| username: Carl | Original post link

If partitioning is done, developers may need to adjust the query logic. Will the changes be significant?

If partitioning is not done, for example, storing one month’s historical data as cold data, can it be separated by date?

| username: realcp1018 | Original post link

It won’t necessarily be slow; as long as it’s single-point read and write, it will still be fast. Millions of changes per day are not considered much. If the writes are concentrated, there will be hotspot issues, but using hash partitioning for this will make data cleanup difficult. Using date range partitioning can conveniently clean up expired data. As for whether historical data can be separated out later, the difficulty is the same whether you use partitioning or not.

| username: tidb菜鸟一只 | Original post link

Does frequent changes refer to inserts or modifications and deletions?

| username: Carl | Original post link

The main business transaction data dimension is queried and updated frequently on the same day. Apart from separating the data of the current day, the historical data cannot be deleted, and the analysis report dimension is within one year.
So, is there a good strategy to automatically partition and store business data by date without affecting the analysis department’s data queries and analysis?

| username: Carl | Original post link

Replace update and insert, simply put, it’s device heartbeat data, as well as operation logs, such as switches, etc.

| username: TiDBer_pkQ5q1l0 | Original post link

We also have this business scenario, but it seems that a delay of tens of milliseconds is still acceptable. For batch updates, the analysis scenario is synchronized to another TiDB cluster specifically for analysis purposes.

| username: Carl | Original post link

Currently, this table stores about 6GB per day. If we don’t split it up, perform cold backups, or isolate historical data in a timely manner, it will affect performance since the business doesn’t need to query such a large table. Additionally, if we don’t regularly delete or back up the data, it will be difficult to back up or migrate later on. Can partitioning solve my problem?

| username: TiDBer_pkQ5q1l0 | Original post link

It needs to be archived regularly.

| username: realcp1018 | Original post link

If you’re talking about later data migration and such, then sharding is better than partitioning, although the code might be more cumbersome to write. Performance really doesn’t need to be a priority. The advantage of partitioned tables is data cleanup. TiDB’s distributed design means that the performance of partitioned tables compared to regular large tables doesn’t necessarily have any improvement. It’s inherently distributed, essentially like sharding done for you.

To put it plainly, if you’re worried about future archiving and backup, then sharding is the best; partitioning or not won’t solve this problem. If you’re worried about performance, then whether you shard or not, it’s pretty much the same. Partitioning only addresses write hotspots or data cleanup, just remember this point.

If it were me, I would definitely go with date range partitioning.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.