Label Data Source After Data Migration

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 数据迁移后标注数据来源

| username: TiDBer_QHSxuEa1

Excuse me, everyone. I want to add a marker to check the data source after migrating the data, so that it is easy to know where the data comes from. Currently, the approach I am considering is to add a field to the table, and after the migration is complete, perform an update operation on this field, such as update table set data_source='migrated data'. This method works fine for a single table, but is it still suitable if there are many tables? Is there a better way?

| username: Miracle | Original post link

If each row needs to be marked, it can only be achieved by adding a field, right?

| username: 春风十里 | Original post link

Add a field, such as data source, and insert it directly during migration or update it later.

| username: zhanggame1 | Original post link

There is no better way.

| username: okenJiang | Original post link

Actually, there is, but it is not mentioned in the documentation (not sure why).

You can refer to this test example for usage: tiflow/dm/tests/extend_column/conf/dm-task.yaml at master · pingcap/tiflow · GitHub

Here is the issue: Distinguish data source when merge shared tables with no shared key · Issue #3340 · pingcap/tiflow · GitHub

| username: andone | Original post link

Add a field with a default value.

| username: Kongdom | Original post link

:yum: Add a field, and then modify the default value of this field before each migration. Not sure which is faster compared to update.

| username: Jolyne | Original post link

We are adding a timestamp here to uniformly record the migration time.

| username: come_true | Original post link

Adding a timestamp is fine.

| username: Kongdom | Original post link

:yum: There’s another method: rename the table after each migration, with one table per data source, which is similar to sharding. Renaming should be particularly fast. Use views for querying.

| username: tidb菜鸟一只 | Original post link

Wouldn’t it be faster to add a comment to the table?

| username: 小龙虾爱大龙虾 | Original post link

The data in different rows of the original poster’s table come from different sources.

| username: tidb菜鸟一只 | Original post link

For the same table, add a field to mark the data source based on different data sources. It depends on the tool. If it’s Kafka or Kettle, you can add a separate column to set the corresponding value for migration. If it’s DM or CDC, it probably isn’t supported at the moment. Using update might be the least efficient; it would be better to set a default value instead…

| username: dba远航 | Original post link

The path during backup can include the name of the source.