Check the oldest TSO in TiDB and use TiCDC to resynchronize to minimize data loss

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 查看tidb最老tso利用ticdc重新同步尽可能减少数据丢失

| username: danghuagood

[TiDB Usage Environment]
Production Environment
[TiDB Version]
v6.1.0
[Encountered Problem: Phenomenon and Impact]
Currently using ticdc to synchronize data to Kafka, where data in Kafka is only retained for 7 days by default. The business consumption program had an abnormal stop for more than 7 days, causing the messages in Kafka to be deleted. I want to find the oldest TSO currently retained in TiDB and use ticdc to re-capture from this oldest TSO to minimize data loss as much as possible.

I would like to know how to check the information of the oldest TSO currently retained in TiDB.

| username: Billmay表妹 | Original post link

In TiDB, you can obtain the information of the oldest retained TSO by querying TiDB’s status variables. The specific steps are as follows:

  1. Log in to the TiDB console or use a MySQL client to connect to TiDB.

  2. Execute the following SQL statement to query TiDB’s status variables:

    SHOW STATUS LIKE 'tikv_gc_safe_point';
    

    This SQL statement will return a status variable named tikv_gc_safe_point, which indicates the oldest TSO currently retained in the TiDB cluster.

    For example, the returned result might be as follows:

    +------------------+---------------------+
    | Variable_name    | Value               |
    +------------------+---------------------+
    | tikv_gc_safe_point | 425355555555555555 |
    +------------------+---------------------+
    

    In this example, the value of tikv_gc_safe_point is 425355555555555555, indicating the oldest TSO currently retained in the TiDB cluster.

  3. Note down the value of tikv_gc_safe_point as the starting TSO for re-fetching data.

    When using TiCDC to re-fetch data, you can specify the starting TSO using the --start-ts parameter. For example:

    ticdc cli changefeed update --start-ts=425355555555555555
    

    In this example, the --start-ts parameter specifies the starting TSO as 425355555555555555, and TiCDC will start re-fetching data from this TSO.

I hope the above information helps you solve your problem. If you need more assistance, please provide more detailed information, and I will do my best to help you.

| username: tidb狂热爱好者 | Original post link

My cousin is really dedicated. Impressive.

| username: wfxxh | Original post link

Or

| username: danghuagood | Original post link

No data :sob: :sob: :sob:

| username: xfworld | Original post link

SELECT * FROM mysql.tidb WHERE variable_name = ‘tikv_gc_safe_point’;

What about this?

If the GC time is not found, then there is a problem, and CDC won’t be able to select TSO…

Reference:

| username: ljluestc | Original post link

  1. First, connect to the TiDB console or use a MySQL client to execute the following command to check the current status of the TiDB cluster:
SHOW VARIABLES LIKE 'tikv_gc_life_time';

This command will display the value of tikv_gc_life_time, which represents the maximum garbage collection (GC) lifetime of data in TiKV (the distributed storage engine). By default, this value is 10m (10 minutes).

  1. Based on the value of tikv_gc_life_time obtained in the previous step, calculate the timestamp of the oldest TSO. Assuming the value of tikv_gc_life_time is 10m, the timestamp of the oldest TSO can be calculated as the current time minus 10 minutes. You can use the following command to get the current time:
SELECT NOW();
  1. Calculate the timestamp of the oldest TSO and find the data changes before this timestamp. You can use the TiCDC tool to resynchronize these data changes to Kafka. The specific steps are as follows:
    a. Install the TiCDC tool: Install the corresponding version of the TiCDC tool according to the TiDB version.
    b. Configure TiCDC: Edit the TiCDC configuration file (ticdc.toml) to specify the target for data synchronization, such as Kafka.
    c. Start TiCDC: Use the TiCDC tool to start the data synchronization task.
    d. Specify the start time: When starting the TiCDC data synchronization task, you can specify the start synchronization time through parameters, setting this time to the calculated timestamp of the oldest TSO. For example, assuming the timestamp of the oldest TSO is “2023-07-25 12:00:00”, you can use the following parameters when starting TiCDC:
ticdc cli changefeed create --start-ts=1679827200000000000 --sink-uri="kafka://your_kafka_broker"

This will start synchronizing data changes to Kafka from the specified TSO timestamp.

Please note that using the TiCDC tool may involve some complex configurations and operations, so make sure you have a sufficient understanding of TiCDC operations before executing them, and test in a test environment. Additionally, to prevent data loss, it is recommended to back up TiDB data before performing operations.