Is there a way to know which tables TiCDC ignored during synchronization?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 有没有办法知道TiCDC在同步时忽略了哪些表?

| username: ddhe9527

When creating a changefeed, if the tables in the upstream TiDB do not have primary keys or unique indexes, a prompt will ask whether to ignore them. After the changefeed is created, any subsequently created incompatible tables will not be synchronized to the downstream by CDC and will be directly ignored. Is there any way to find out which tables were ignored later, other than searching the CDC logs?

| username: danghuagood | Original post link

Check the changefeed configuration to confirm which tables are being synchronized, and you will know which other tables are being ignored.

cdc cli changefeed query --server=http://10.0.10.25:8300 --changefeed-id=simple-replication-task

| username: dba远航 | Original post link

Just directly query the upstream database to see which tables do not have primary keys and unique indexes.

| username: Soysauce520 | Original post link

  1. The upstream creation will not prompt whether to ignore it; it will be automatically ignored.
  2. Query the tables without primary keys and unique non-null indexes; those are the ones not synchronized.
  3. It is recommended to enable the primary key parameter so that table creation must have a primary key to succeed, which can avoid synchronization issues.
| username: porpoiselxj | Original post link

In some business scenarios, certain tables are indeed not suitable for creating primary keys and uniqueness (such as partitioned log tables). If synchronization is needed, you can enable the parameter force-replicate=true to support these tables.

| username: ddhe9527 | Original post link

Thank you. Regarding the filtering rules here, I would like to ask for some advice. I configured ., intending to replicate the mysql database to the downstream CDC. However, I found that creating tables or users in the mysql database cannot be synchronized downstream. What could be the reason for this?

In the CDC logs, it indicates that the creation of table ‘taa’ was discarded.

| username: ddhe9527 | Original post link

Thank you, I will try this parameter.

| username: ddhe9527 | Original post link

Directly querying tables without primary keys and unique indexes upstream makes it impossible to distinguish whether the tables were intentionally ignored when creating the changefeed or if they were automatically ignored after the changefeed started running.

| username: tidb菜鸟一只 | Original post link

So, if you compare it with the tables in your configuration file, aren’t these the tables that are automatically ignored? However, CDC does indeed prompt you with similar information when creating it to confirm. Whenever I see this, I always check if ignoring them is an issue. If there’s no problem, I confirm with ‘y’.

[WARN] some tables are not eligible to replicate,
v2.TableName{
v2.TableName{Schema:“tpcc”, Table:“history”, TableID:550, IsPartition:false},
v2.TableName{Schema:“citizencard-pay”, Table:“cc_merchant_bill”, TableID:827, IsPartition:false},
v2.TableName{Schema:“citizencard-pay”, Table:“cc_scan_pay”, TableID:831, IsPartition:false},
v2.TableName{Schema:“citizencard-biz”, Table:“cc_banking_outlets”, TableID:1260, IsPartition:false},
v2.TableName{Schema:“citizencard-biz”, Table:“cc_merchant_bill”, TableID:1272, IsPartition:false},
v2.TableName{Schema:“citizencard-biz”, Table:“cc_message_set_attr”, TableID:1278, IsPartition:false},
v2.TableName{Schema:“citizencard-biz”, Table:“cc_scan_pay”, TableID:1294, IsPartition:false},
v2.TableName{Schema:“citizencard-biz”, Table:“cc_t_robot_knowledge_20230216”, TableID:1300, IsPartition:false},
v2.TableName{Schema:“citizencard-biz”, Table:“cc_t_robot_knowledge_20230316”, TableID:1302, IsPartition:false},
v2.TableName{Schema:“citizencard-biz”, Table:“cc_banking_outlets_1228”, TableID:1622, IsPartition:false}}

| username: ddhe9527 | Original post link

Yes, this method is certainly feasible, but from the perspective of monitoring and operations, the risk is quite high. For example, last year you set up a remote disaster recovery using CDC, and after running for more than a year, you suddenly need to perform a disaster recovery switch. Then you find that many tables are missing in the TiDB backup cluster. It’s hard to say whether these tables were deliberately ignored by the DBA or were created by the application or development team over the past year. Moreover, the prompt information only appears once and then disappears, and the CDC logs might also be cleaned up.

If there were an API interface or table view to query or monitor these discarded tables, it could actually be avoided in advance.

| username: zhanggame1 | Original post link

The MySQL database will not be synchronized to the downstream. Refer to the documentation here, it does not include system tables.

| username: zhanggame1 | Original post link

By default, tables without primary keys, etc., will not be synchronized. You can consider: force-replicate = true

| username: dba-kit | Original post link

There is no direct way to display the table names, but you can use tiup cdc cli --server="" changefeed query -c changefeed-test | jq '.task_status' to get the synchronized table_ids, and then write a SHELL script to get the corresponding table names based on the table_id (select table_schema, table_name from information_schema.tables where tidb_table_id in (id1, id2,....);

| username: dba-kit | Original post link

If you want to handle it with a script, you can consider using the TiCDC Open API to get task information. The following two APIs should be enough:

| username: dba-kit | Original post link

This is a script code generated by Zhiyu Qingyan AI. After a quick look, it seems feasible. You can try it out :monkey:

To use pymysql to query a MySQL database, you first need to install this library (if you haven’t already):

pip install pymysql

Here is an example of a Python script that uses pymysql to execute a constructed SQL query and print the query results:

import pymysql
import requests
# Get the list of all synchronization sub-tasks
url = "http://127.0.0.1:8300/api/v2/processors"
response = requests.get(url)
data = response.json()
# Parse the information of all synchronization sub-tasks
processors = data.get("items", [])
# Store the list of table_ids corresponding to each changefeed_id
table_id_mapping = {}
for processor in processors:
    changefeed_id = processor.get("changefeed_id")
    capture_id = processor.get("capture_id")
    # Construct the URL to query detailed information
    detail_url = f"http://127.0.0.1:8300/api/v2/processors/{changefeed_id}/{capture_id}"
    detail_response = requests.get(detail_url)
    detail_data = detail_response.json()
    # Get the table_ids information of the specified synchronization sub-task
    table_ids = detail_data.get("table_ids", [])
    # Add table_ids to the corresponding changefeed_id list
    if changefeed_id not in table_id_mapping:
        table_id_mapping[changefeed_id] = {}
    for table_id in table_ids:
        if table_id not in table_id_mapping[changefeed_id]:
            table_id_mapping[changefeed_id][table_id] = []
        table_id_mapping[changefeed_id][table_id].append(capture_id)
# Construct SQL query
sql_query = ""
for changefeed_id, table_id_list in table_id_mapping.items():
    for table_id in table_id_list:
        if sql_query:
            sql_query += ","
        sql_query += f"'{table_id}'"
# Add SQL query
sql_query = f"SELECT table_schema, table_name FROM information_schema.tables WHERE tidb_table_id IN ({sql_query});"
# Print SQL query
print(sql_query)
# Connect to MySQL database
connection = pymysql.connect(host='localhost',
                             user='your_username',
                             password='your_password',
                             database='your_database',
                             charset='utf8mb4',
                             cursorclass=pymysql.cursors.DictCursor)
try:
    with connection.cursor() as cursor:
        # Execute SQL query
        cursor.execute(sql_query)
        # Get query results
        result = cursor.fetchall()
        for row in result:
            print(row)
finally:
    connection.close()

In this script, you need to replace your_username, your_password, your_database with your actual MySQL database login credentials and database name. Additionally, if the column name in your information_schema.tables table is not tidb_table_id, you need to adjust the SQL query accordingly.

The script first constructs an SQL query containing all table_ids, then executes this query and prints the query results. Note that this script uses a cursor type called DictCursor, which can convert each row in the result set into a Python dictionary, making it more convenient to access column attributes.

Please ensure that before running this script, you have correctly configured the MySQL database connection credentials, and that the database server is running and accessible.