Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: tidb做ddl修改字段类型,什么版本可以出只同步修改元数据,而不同步修改数据的操作
[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version]
[Reproduction Path]
Modifying the column type of a table is called Reorg DDL, which will modify both metadata and table data. Modifying table data will cause ticdc to experience delays. In which version does modifying the column type Reorg DDL not cause ticdc delays?
[Encountered Problem: Problem Phenomenon and Impact]
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]
Which version are you using now?
There will be a significant delay. I tested modifying the field type of a table with tens of millions of records, changing int to varchar, and it took several minutes.
Is there any database that can perform DDL changes without synchronizing metadata and data?
TiDB version 5.1 supports reorg DDL. In version 6.2, TiDB adopts a concurrent framework that allows asynchronous processing of DDL. Version 6.5 optimizes the impact of DDL on TiCDC when synchronizing DML data, executing in the order of DML → DDL → DML, and also supports filtering specified types of DDL events.
If the table column type changes, can the original data remain unchanged in type?
Data backfill is required for destructive changes , but it shouldn’t be necessary for non-destructive changes.
TiCDC is a tool for obtaining real-time change logs of TiKV row data. If there is a synchronization delay after modifying the data type, it indicates that the modification of the data type has caused a large amount of data changes.
In TiDB, when does modifying the data type result in data changes? That’s right, it happens during lossy changes, such as converting int to varchar. Typically, this involves changes from a larger range to a smaller range or across different data types. You can refer to the official documentation for specific types that are considered lossy changes.
Conversely, this answers your second question: when the modified data type does not result in lossy changes, there will be no changes to the data rows, and thus no significant data modifications. In this case, TiCDC synchronization delays will not occur, such as converting tinyint to int.
When modifying the metadata of a table, you can specify the type of the column. If you specify the original type, it will not change.
Could you please specify the type of modification, from what type to what type?
I think changing the field type without triggering data modification is unlikely to be achievable.
What is your applicable scenario? There should be prerequisites, right? For example, if the original string is “abc” and you forcefully change the type to int, it definitely shouldn’t be possible to modify it.
Changing the timestamp type to datetime type. I suspect there is a bug with the timestamp type. When deleting a large amount of data and having high concurrent transactions, it may lead to data and index inconsistencies.
There will be a delay, a delay will occur, but it will continue to sync afterwards, right?
If the datetime field does not set precision, the default precision is 0, which only retains up to seconds. In practice, this should be a lossy adjustment. It seems unlikely that the metadata would remain unchanged.
The actual type will move the actual data, and CDC will have a delay. When I moved a 10G table, the downstream Kafka data volume surged.
You probably haven’t encountered any issues. For a 10G table, the downstream Kafka can handle one to two hundred gigabytes. If you need to modify a very large table, the downstream Kafka can be overwhelmed.
Indeed, we haven’t encountered this issue. We are using downstream MySQL and haven’t had to modify data types yet. If what you say is true, then downstream MySQL might also be unable to synchronize.
If there is no business data being written at night, you can stop CDC first.
Personally, I feel that the two time types can be losslessly converted by configuration, with timestamp converted to datetime.
Apart from the different storage requirements between DATETIME and TIMESTAMP:
The TIMESTAMP data type converts date and time data to UTC (Coordinated Universal Time) for storage and also converts the data to the current time zone when queried.
The DATETIME data type stores the exact time inputted and returns the same time when queried, without any time zone conversion.
The initial values for DATETIME and TIMESTAMP can be set to null, any timestamp within the valid range, or the current timestamp.
Wouldn’t changing varchar(10) to 5 be equivalent to modifying the current field to substr(1,5) of the current field?