Consultation on Migrating from MySQL to TiDB

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: mysql改成tidb咨询

| username: 开心大河马

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version]
[Reproduction Path] What operations were performed that led to the issue
[Encountered Issue: Issue Phenomenon and Impact]
[Resource Configuration]
Changing MySQL to TiDB architecture:

Changing a stream table to a single table in TiDB,

  1. In the most extreme case, the single table might have a daily data volume of 100 million. Will using an auto-increment primary key for the ID cause hotspot issues? Since there is historical data migration, it is not possible to directly use random IDs. Is there a specific feasible optimization method? If switching to TiDB later, is it feasible to change to random IDs online?
  2. If this table is also a partitioned table in TiDB, and if data is retained for a maximum of 2 months with older data being exchanged out using the exchange partition method, will the number of regions in this table directly decrease? Additionally, if the ID is changed to random as mentioned in point 1, will there be duplicate IDs in the exchanged-out partitions?
| username: tidb菜鸟一只 | Original post link

  1. As I understand it, you can use the auto_random feature to distribute write hotspots for new data. For historical data migration, you can use the AUTO_RANDOM_BASE parameter to ensure that the IDs of new data do not overlap with the old data.
  2. The values generated by auto_random are unique. Once a random value has been generated, even if you exchange it, the same random value will not be generated again in the future.
| username: 开心大河马 | Original post link

I’ll test it, thanks.

| username: 有猫万事足 | Original post link

  1. If you want to change the type of this field, the solution has already been provided above.
    If you don’t want to change the type of this field, you can use a non-clustered table + SHARD_ROW_ID_BITS + PRE_SPLIT_REGIONS:
    The number of split regions can be adjusted as needed. From my personal experience: if during heavy writes, not all machines have high load but only a few, it might be because the number of splits is too small.
    The table creation statement is similar to the one below, with the commented part being mandatory:
CREATE TABLE `UserCardLog_date` (
  `Id` int(11) NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`Id`) /*T![clustered_index] NONCLUSTERED */
) ENGINE=InnoDB /*T! SHARD_ROW_ID_BITS=4 PRE_SPLIT_REGIONS=4 */;

Documentation:

  1. The id will not be duplicated.
| username: 开心大河马 | Original post link

Hello, I would like to ask if you know if there is a specific formula for calculating SHARD_ROW_ID_BITS or what it is related to, such as the number of underlying TiKV nodes, the number of concurrent inserts, or something else. Currently, in our tests, we have about 4.8 billion data and around 23,000 regions at the underlying level.

| username: 有猫万事足 | Original post link

Prerequisites:
PRE_SPLIT_REGIONS=4 means pre-splitting into 16 regions.
And PRE_SPLIT_REGIONS cannot be larger than SHARD_ROW_ID_BITS.

Relationship with the number of TiKV nodes:
In my actual usage, I have 4 TiKVs, but when I set PRE_SPLIT_REGIONS=2, which pre-splits into 4 regions, one TiKV’s write load never ramps up. When I set PRE_SPLIT_REGIONS=4, which pre-splits into 16 regions, under heavy write conditions, the CPU and IO of all 4 TiKVs reach a relatively high level. If you have more TiKV instances, I think you can pre-split more regions as much as possible. My experience is that when the number of pre-split regions matches the number of TiKV instances, it may not necessarily ensure balanced load. Slightly more regions tend to balance the load better.

Relationship with insertion concurrency:
In my case, the number of connections is slightly higher, but the TPS/QPS is very low. So overall concurrency on a single table will be even lower. I don’t have experience in this area and can’t provide any suggestions.

| username: 开心大河马 | Original post link

Sure, thank you very much for your help.

| username: zhanggame1 | Original post link

The write performance is related to the number of TiKV physical nodes and the performance of NVMe drives. Both factors need to be comprehensively considered and tested.

| username: 开心大河马 | Original post link

Okay, the TiKV nodes are going to be physical machines, while the others will be virtual machines.

| username: zhanggame1 | Original post link

TiKV can be used with a virtual machine, but it is recommended to use a separate data disk for direct passthrough to the virtual machine. This way, you won’t lose IO performance due to virtualized disks. For example, you can configure an NVMe disk passthrough in ESXi. Refer to the following:

VMware ESXi 6.7 Virtual Machine Using Host Passthrough Devices - 人走茶良 - 博客园 (cnblogs.com)

| username: redgame | Original post link

Try to use a physical machine as much as possible; virtual machines won’t be ideal…

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.