How to Capture Data Changes within a Unit of Time in TiDB?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB如何获取单位时间内的数据变化?

| username: Kongdom

【TiDB Usage Environment】Testing
【TiDB Version】v5.4.3
【Reproduction Path】None
【Encountered Problem: Problem Phenomenon and Impact】
Currently considering using .NET to create a scheduling tool similar to TiCDC for data synchronization. Is there an open interface or table to check data changes within a unit of time?

| username: xfworld | Original post link

Previously, binlog could be stored in file mode and restored for querying through other methods.

With TiCDC, you need to save all the change information yourself and then query according to your scenario requirements. However, you need to consider the data scale and growth rate… which is a bit challenging.

The simplest solution:

TICDC ---> kafka ----> kafka client consumer -> other tidb.... or doris, starRocks....

You can also consider using a Flink link to act as a sink service, allowing downstream to connect to more services.

| username: Kongdom | Original post link

:sweat_smile: The original intention was that the server is not enough, so I want to skip TiCDC to save one server~

| username: xfworld | Original post link

Then just use snapshots, but you need to set a cycle. However, it’s not as accurate as incremental.

It’s definitely not that perfect.

| username: Kongdom | Original post link

Please elaborate~ :astonished:

| username: xfworld | Original post link

Read the data every 1 minute, or X minutes, use the time as the version field, and then save it.

However, this approach cannot handle the state where data is deleted… It can only be judged and processed through a compensation algorithm.

No TICDC directly…

| username: Kongdom | Original post link

It shouldn’t be that complicated. You just need to check if the data has changed within X minutes. If there is a change, perform incremental synchronization. You don’t need to know what it has changed to. I see there is a flashback feature; I wonder if it can be used.

| username: xfworld | Original post link


Flashback is a data recovery feature provided by the TiDB database that allows the database to be restored to the state at a specific historical point in time. The flashback feature is somewhat related to GC (Garbage Collection).

In TiDB, GC is used to reclaim storage space occupied by deleted data. GC periodically scans the data in TiKV, marks the deleted data as “reclaimable,” and then actually reclaims this data during subsequent compaction processes. The GC process is irreversible; once data is reclaimed by GC, it cannot be restored.

The flashback feature, on the other hand, is used to recover deleted data. It can restore the database to a specific historical point in time, including the deleted data. Therefore, the flashback feature can be used to recover data that has been reclaimed by GC.

It is important to note that the flashback feature can only recover deleted data and cannot recover modified data. Therefore, when using the flashback feature, it is crucial to choose the correct point in time to avoid restoring to an incorrect state.

In summary, the flashback feature and GC are somewhat related. GC is used to reclaim storage space occupied by deleted data, while the flashback feature is used to recover deleted data. The flashback feature can be used to recover data that has been reclaimed by GC but cannot recover modified data.

| username: Kongdom | Original post link

You can use TiDB’s SELECT ... AS OF TIMESTAMP statement to retrieve data that has changed over a period of time. The specific steps are as follows:

  1. First, you need to enable TiDB’s flashback feature on the table. You can enable the flashback feature using the following command:

    ALTER TABLE table_name ENABLE FLASHBACK;
    
  2. Then, you can use the following syntax to query data that has changed over a period of time:

    SELECT * FROM table_name AS OF TIMESTAMP 'start_time' WHERE update_time >= 'start_time' AND update_time < 'end_time';
    

    Here, table_name is the name of the table you want to query, start_time and end_time are the time range you want to query, and update_time is the column name for the record update time in your table.

    For example, if you want to query data that has changed in the table_name table between May 1, 2021, and May 31, 2021, you can use the following statement:

    SELECT * FROM table_name AS OF TIMESTAMP '2021-05-01 00:00:00' WHERE update_time >= '2021-05-01 00:00:00' AND update_time < '2021-06-01 00:00:00';
    

    This will return all rows that were updated within that time range.

More information about TiDB’s flashback feature can be found in [1], and detailed information about the SELECT ... AS OF TIMESTAMP statement can be found in [2].

Note: The flashback feature requires TiDB Binlog to be enabled in the TiDB cluster, so before using the flashback feature, please ensure that your TiDB cluster has TiDB Binlog enabled.

| username: xfworld | Original post link

Well, relying on binlog, still need to add machines~ :upside_down_face:

| username: Kongdom | Original post link

:astonished: Huh? Is this Binlog the same as the peripheral tool Binlog?

| username: Kongdom | Original post link

I’ll let my colleague try it.

| username: xfworld | Original post link

5.X support is not very good, give it a try.

| username: zhanggame1 | Original post link

I think adding an update_time timestamp column will do. Set the attribute to auto-update, and then during synchronization, you can search within this timestamp range.
update_time datetime DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP

| username: zhanggame1 | Original post link

When is this document from? The step “ALTER TABLE table_name ENABLE FLASHBACK;” is not needed for “SELECT * FROM table_name AS OF TIMESTAMP”. You can query directly. The MVCC feature has nothing to do with TiDB Binlog.

| username: redgame | Original post link

This cannot be saved. It will lead to endless troubles.

| username: Kongdom | Original post link

:joy: There are too many tables, almost all tables need to be maintained, and it also involves customer upgrades~ I can’t imagine.

| username: Kongdom | Original post link

Oh, is that so?

| username: ealam_小羽 | Original post link

If there is no physical deletion, can DataX be considered for data synchronization?

| username: Kongdom | Original post link

Currently, the main consideration is to integrate it into our own software, without using third-party software for management and scheduling.
Actually, I’m also curious about how these software detect data changes.