Migrating Data from Two Databases to TiDB

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 把两个数据数据库的数据迁移到tidb

| username: TiDBer_QHSxuEa1

Currently, we need to migrate the data sources of two databases to TiDB. I would like to ask everyone two questions:

  1. How to handle auto-increment primary key conflicts between the two data sources? For example, table ‘a’ in database A has a record with id=1, and table ‘b’ in database B also has a record with id=1. How can we migrate the data from these two tables in databases A and B to table ‘c’ in the target TiDB?
  2. Due to the merging, the downstream target database TiDB will need to adjust the table structure, meaning the table structures in the source and target databases will be different. Can this kind of migration be accomplished using TiDB’s built-in tools, such as Dumpling and Lightning?
| username: 小毛毛虫 | Original post link

You need to use DM for this. DM supports operations such as merging databases and tables.

| username: zhaokede | Original post link

You can try using an intermediate table for transformation and data processing, and then merge it into the official table. If the data volume is not too large, you can give it a try.

| username: 小龙虾爱大龙虾 | Original post link

“The data with id=1 in table a of database A” and “the data with id=1 in table b of database B” do not conflict, just go ahead and do it.
What exactly do you mean by different table structures? :rofl:

| username: TiDBer_QHSxuEa1 | Original post link

I didn’t express it clearly. The data from tables a and b are migrated together to table c in the target TiDB database. The structures of tables a, b, and c are the same, and they all have an auto-increment primary key with the same field name. The difference in table structure means that due to business adjustments and other factors, the table structure in the target database may change, and there may be differences between the source and target database table structures.

| username: Kongdom | Original post link

:thinking: In such cases, we usually use ETL tools to perform secondary development ourselves. For primary key conflicts, we generally add a prefix or a field to identify the data source.

| username: 有猫万事足 | Original post link

DM can do it, and it can also extract dbname into an additional field, then this field and the original id can form a composite primary key.

However, the problem with DM is that if the source table has a large amount of data, it will be very slow. To be fast, you still need to use Lightning or other ETL tools for secondary development.

| username: 小龙虾爱大龙虾 | Original post link

What about the marker for unique records in your table C?

| username: xingzhenxiang | Original post link

Learn about DM sharding to a single database.

| username: changpeng75 | Original post link

What type of database is the source database? If it is a relational database, it is best to convert it according to the new table structure in the source database, and then use tools like Dumpling and Lightning to migrate. For logical changes, migration tools may not adapt well.

| username: forever | Original post link

We all process the data at the source end and then deliver it to the target end.

| username: Jellybean | Original post link

He probably means that the data will be merged into the same table in the next step.

| username: TIDB-Learner | Original post link

DM should be able to meet the requirements.

| username: 随缘天空 | Original post link

If the primary keys of tables a and b have no relationship with the tables in the original database, it’s fine. You can use ETL tools to extract and transfer the source data first. However, if there is a relationship, it will be difficult to handle.

| username: Jellybean | Original post link

In this situation, since your business table structure adjustments are relatively significant, automatic synchronization tools may not be suitable. You can check if DM’s sharding and aggregation synchronization feature fits your scenario.

If it doesn’t work, you will likely need to develop some automatic synchronization tools, such as using Flink to achieve real-time automatic aggregation and synchronization.

| username: YuchongXU | Original post link

Suggestions for ETL processing

| username: 江湖故人 | Original post link

Based on your description, it seems you need an ETL tool. I recommend checking out kettle.

| username: Kongdom | Original post link

:+1: Support using Kettle

| username: forever | Original post link

Personally, I feel that DataX is somewhat more lightweight compared to Kettle.

| username: Kongdom | Original post link

From a deployment perspective, Kettle should be a bit simpler :yum: DataX requires an additional Python environment.