The issue of data skew during writes with TiDB range partitioning

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb range分区时写入数据倾斜问题

| username: TiDBer_Xy7fsN7j

CREATE TABLE ServiceFlowKpiRecords (
node_id_str varchar(100) NOT NULL COMMENT ‘The hostname of the device, which is the unique identifier of the device in the network’,
subscription_id_str varchar(100) NOT NULL COMMENT ‘Subscription name’,
collection_id bigint(20) unsigned NOT NULL COMMENT ‘Identifies the sampling round’,
collection_start_time datetime NOT NULL COMMENT ‘Identifies the start time of the sampling round’,
name varchar(100) DEFAULT NULL COMMENT ‘//vlan id’,
channel varchar(100) DEFAULT NULL,
index int(10) unsigned DEFAULT NULL,
downstream_flow_drop_cnt bigint(20) unsigned DEFAULT NULL COMMENT ‘//Total number of downstream flow packet drops’,
downstream_flow_pass_cnt bigint(20) unsigned DEFAULT NULL COMMENT ‘//Total number of downstream flow packets passed’,
downstream_flow_drop_max int(10) unsigned DEFAULT NULL COMMENT ‘//Maximum number of downstream flow packet drops per second’,
downstream_flow_drop_min int(10) unsigned DEFAULT NULL COMMENT ‘//Minimum number of downstream flow packet drops per second’,
downstream_flow_drop_rate_max int(10) unsigned DEFAULT NULL COMMENT ‘//Maximum packet drop rate per second for downstream flow, unit: 10^-5’,
downstream_flow_drop_rate_min int(10) unsigned DEFAULT NULL COMMENT ‘//Minimum packet drop rate per second for downstream flow, unit: 10^-5’,
downstream_flow_drop_seconds_cnt int(10) unsigned DEFAULT NULL COMMENT ‘//Accumulated seconds of packet drops in downstream flow’,
downstream_flow_pass_bytes bigint(20) unsigned DEFAULT NULL COMMENT ‘//Bytes passed in downstream flow’,
downstream_mfr_avg int(10) unsigned DEFAULT NULL COMMENT ‘//Average downstream rate, kbps’,
upstream_pass_bytes bigint(20) unsigned DEFAULT NULL COMMENT ‘//Bytes passed in upstream flow’,
upstream_pass_cnt bigint(20) unsigned DEFAULT NULL COMMENT ‘//Total number of packets passed in upstream flow’,
upstream_drop_cnt bigint(20) unsigned DEFAULT NULL COMMENT ‘//Total number of packet drops in upstream flow’
)
PARTITION BY RANGE COLUMNS (collection_start_time) INTERVAL (60 SECOND) FIRST PARTITION LESS THAN (‘2024-02-21 00:00:00’) LAST PARTITION LESS THAN (‘2024-02-28 00:00:00’);
According to the above table creation statement, the data is written in ascending order of collection_start_time.

| username: TiDBer_jYQINSnf | Original post link

This is quite normal. The regions to be written are mainly distributed on one of the TiKV nodes. Write more data, and after the regions split, they will migrate to other nodes. Gradually, there will be no more hotspots.

| username: TiDBer_Xy7fsN7j | Original post link

Is it possible to increase the number of regions in advance and let the regions split ahead of time?

| username: TiDBer_jYQINSnf | Original post link

You can try this, but will the pressure on your TiKV be very high? Is it necessary to do this?

| username: 随便改个用户名 | Original post link

Take a look at this.

| username: tidb菜鸟一只 | Original post link

For non-clustered index tables like yours, _tidb_rowid will be generated by default. This default setting can indeed cause write hotspots. You can set SHARD_ROW_ID_BITS to alleviate this issue.

| username: 有猫万事足 | Original post link

If a table does not have a primary key or the primary key is not an integer type, and the user does not want to generate a randomly distributed primary key ID, TiDB internally has an implicit _tidb_rowid column as the row ID. Without using SHARD_ROW_ID_BITS, the value of the _tidb_rowid column is also basically monotonically increasing, which can lead to write hotspots (refer to the detailed explanation of SHARD_ROW_ID_BITS).

To avoid write hotspot issues caused by _tidb_rowid, you can use the SHARD_ROW_ID_BITS and PRE_SPLIT_REGIONS table options when creating the table (refer to the detailed explanation of PRE_SPLIT_REGIONS).

SHARD_ROW_ID_BITS is used to randomly scatter the row IDs generated by the _tidb_rowid column. PRE_SPLIT_REGIONS is used to pre-split regions after the table is created.

| username: TiDBer_Xy7fsN7j | Original post link

I tried adding the two parameters mentioned above, but the data hotspot issue still exists. I checked the official documentation, and shard_row_id_bits = 4 and PRE_SPLIT_REGIONS = 4 only apply to non-partitioned tables, right? You see, my table structure is partitioned by range based on time.

| username: zhanggame1 | Original post link

Does this table have no primary key? In version 7.5.0, the write speed should be significantly slower without a primary key compared to having one.

| username: Soysauce520 | Original post link

The official documentation doesn’t say that partitioned tables cannot pre-allocate regions, right? Where did you see that? Can you provide a link?

| username: TiDBer_Xy7fsN7j | Original post link

No, it’s just that the examples I saw on the official website for creating tables with shard_row_id_bits / PRE_SPLIT_REGIONS are all non-partitioned tables.

| username: TiDBer_Xy7fsN7j | Original post link

I will try adding a primary key and then verify it.

| username: zhanggame1 | Original post link

According to best practices, add an AUTO_RANDOM primary key.

| username: TiDBer_Xy7fsN7j | Original post link

If I add a primary key, it prompts that the partition information must include the primary key. But in my business, I only need to partition by time.

| username: TiDBer_Xy7fsN7j | Original post link

Here are the results of my data write test after adding the parameters shard_row_id_bits = 4 and PRE_SPLIT_REGIONS = 4. There is still a noticeable data hotspot issue.

| username: zhanggame1 | Original post link

Partitioned tables are a bit more troublesome; both auto-increment and partition keys must be included in the primary key. Did you add any other indexes?

| username: TiDBer_Xy7fsN7j | Original post link

No, you can look at the main DDL statement in my problem description. Not a single key/index has been added.

| username: TiDBer_Xy7fsN7j | Original post link

We partitioned the data mainly because the volume was too large, so we divided it into 1-minute units. Otherwise, the queries would be very slow.

| username: dba-kit | Original post link

I suggest creating an index on collection_start_time and then changing it to one partition per month.

| username: dba-kit | Original post link

TiDB’s partitioning is quite similar to MySQL’s, where each partition is essentially a separate table. The two parameters you set will take effect on each partition because you create a new partition every minute, which is equivalent to creating a new table.

By default, the tidb_scatter_region value is OFF, so it’s easy to start writing before the scattered regions are distributed to other TiKV nodes, causing hotspots to still concentrate on a single TiKV node.

However, I still recommend following what I mentioned above: create a partition for each month and create a separate index for the collection_start_time field. When there are many partitions in a TiDB table, cross-partition query efficiency will be very low.