Troubleshooting TiDB Binlog Lost DML Events Issue

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb binlog丢失DML事件问题排查

| username: realcp1018

[TiDB Usage Environment] Production Environment
[TiDB Version] 5.0.2
[Issue Performance] DML binlog events of certain tables in the cluster are missing.
[Issue Description]
Pump component configuration:

# WARNING: This file is auto-generated. Do not edit! All your modification will be overwritten!
# You can use 'tiup cluster edit-config' and 'tiup cluster reload' to update the configuration
# All configuration items you want to change can be added to:
# server_configs:
#   pump:
#     aa.b1.c3: value
#     aa.b2.c4: value
gc = 7

Drainer component configuration:

[syncer]
db-type = "file"
[syncer.to]
retention-time = 7

Configuration file for using reparo 5.0.2 version to print SQL:

data-dir = "/data/drainer-8249/data"
log-level = "debug"
dest-type = "print"
txn-batch = 20
worker-count = 16
safe-mode = false

To reduce interference, I removed the database and table filters, and then performed the following operations in the database:

  1. Created a test database and test.t table, and inserted some data. The result shows that only DDL statements can be seen, all DML are missing.
  2. Created a t table in a database that can normally output DML events. The result still shows that only DDL statements can be seen, all DML are missing.

Under what circumstances would all DML events be missing? I checked the logs of the 3 pump instances and did not see any ERROR level errors. The drainer only outputs related INFO during DDL events, and then there are regular “write save point” logs.

| username: realcp1018 | Original post link

Amazing, I found a table that can output DML binlog normally. Using the same client and the same MySQL driver to update the same table as others, my updates do not output binlog, but his program’s updates do output binlog. However, his updates are done once every minute and occasionally get lost. Next, I plan to reinstall Pump and see.

| username: 像风一样的男子 | Original post link

You can try using drainer to output to a file, then use Reparo to parse these binlog files into SQL, and check if any DML statements are indeed lost.

| username: realcp1018 | Original post link

Entering the realm of metaphysics, a DML is sandwiched between two DDLs, and the DML is hidden.

| username: realcp1018 | Original post link

Reinstallation has to wait until the business off-peak period. Currently, we are still looking for a direction. It seems that there might be some orphaned TiDB instances, which could be due to cluster configuration issues. The above image shows that binlog is enabled, but I feel it is not reliable.

| username: realcp1018 | Original post link

:joy: The issue has been found. The orphaned TiDB wasn’t located, but we found a TiDB instance that shows binlog enabled but is actually not enabled. It’s one of the three instances in the screenshot where binlog.enable is true.

We tested all three TiDB instances. Two of them can generate binlogs normally, but one cannot. Is there a more direct way (such as logs) to prove that a specific TiDB is outputting binlogs?

In extreme abnormal scenarios, the information_schema.cluster_config table is obviously not reliable either.

| username: realcp1018 | Original post link

Supplement:
After investigating these past few days, we found that another cluster with version 6.5.1 has the same issue. Our usual method of installing binlog involves installing the pump component first, then modifying the binlog.enable parameter of the TiDB instance, and finally reloading TiDB.
Initially, I thought there might be an issue with the installation of the pump component and that TiDB should be restarted after reloading. This is because the common phenomenon in both clusters is that only one TiDB instance’s binlog is effective, which might be due to issues caused by the rolling restart during reload.
However, upon further consideration, there doesn’t seem to be any logical flaw in reloading after modifying the TiDB configuration. I tested it on a 7.0 cluster, and the installed binlog was effective on all TiDB instances.
Currently, we suspect that certain versions of tiup/tidb/binlog might have some deficiencies during reload. To be safe, we now perform a restart of TiDB after reloading.
This bug cannot be detected by conventional means. Both the TiDB 10080 port and system views show that binlog.enable is enabled, but in reality, parsing the binlog files or Kafka messages reveals that binlog cannot be generated on certain TiDB instances.
For now, we can only manually check all binlog clusters, which significantly impacts synchronization and fault recovery.

| username: zhanggame1 | Original post link

Is the binlog parameter modified in the following way?

Execute the edit-config command to modify the cluster configuration:
$ tiup cluster edit-config tidb-test
Under the tidb section in server_configs, add the following configuration
Execute the edit-config command to modify the cluster configuration:
binlog.enable: true
binlog.ignore-error: true

Then reload

| username: realcp1018 | Original post link

Similar, by reading meta.yaml, modifying and overwriting the original content, after reloading, the configuration files of all TiDB instances will show that the changes have been synchronized.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.