Restarting Drainer in TiDB 4.0 Version Causes TiDB Cluster Service Unavailability

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB4.0版本restart drainer导致TiDB集群服务不可用

| username: 玩具果冻

[TiDB Usage Environment] Production Environment
[TiDB Version] 4.0.10
[Reproduction Path] Operations that led to the issue
After modifying the value of initial-commit-ts in the file /home/tidb/.tiup/storage/cluster/clusters/**********/meta.yaml, executing the command tiup cluster restart tidb-name --node ********:8249 caused the entire TiDB cluster to become unavailable, resulting in business errors.
[Encountered Issue: Symptoms and Impact]


After executing the command tiup cluster restart tidb-name --node ********:8249, the prompt “Cluster will be unavailable” appeared. I would like to ask why restarting a specified node would cause the cluster to become unavailable. Previously, I often restarted a specific node without any issues, and the cluster remained available. This time, I don’t understand why it caused the cluster to become unavailable. I would like to seek advice from the experts, thank you!
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

| username: tidb菜鸟一只 | Original post link

Did you mess up the meta file? Check if you get the same error when you run tiup cluster display cluster clustername.

| username: 玩具果冻 | Original post link

No problem, the cluster is normal after the restart. It’s just that during the restart process, the service is unavailable.

| username: Billmay表妹 | Original post link

Refer to this configuration. When reconfiguring, you need to clear the downstream checkpoint information and the imported data to avoid primary key conflicts.

| username: 玩具果冻 | Original post link

The downstream data of the cluster has not been redone, and the checkpoint table downstream has recorded the previous ts value. The official website says that it will prioritize reading that value.

| username: WalterWj | Original post link

It is not recommended to manually edit the meta file using vi. Use the edit-config syntax to operate, as it will include format and other checks.

| username: 裤衩儿飞上天 | Original post link

Is it convenient to send the modified file?

| username: 玩具果冻 | Original post link

The image is not visible. Please provide the text you need translated.

| username: 玩具果冻 | Original post link

I just changed the value of initial-commit-ts. I want to ask if there is a problem with this file, will it cause the cluster to be unavailable? But after restarting, the cluster is normal. Why is there a prompt that the cluster is unavailable after executing the restart command? I don’t understand this point.

| username: 玩具果冻 | Original post link

Yes, I won’t change it in the future. Now I just want to know why the cluster is unavailable, so I’m asking everyone for advice. Otherwise, I won’t feel confident about any future operations.

| username: 裤衩儿飞上天 | Original post link

What is the current status? Is the cluster available? Or is it just that error (prompt) appearing without affecting usage?

| username: 玩具果冻 | Original post link

During the restart of the drainer node, the cluster was unavailable for about 3 minutes. After the restart was successful, the cluster was completely normal and available. The current question is why the operation of restarting the drainer reports the “Cluster will be unavailable” prompt, causing the cluster to be unavailable during the restart process. This is puzzling because other nodes have been restarted multiple times before without affecting the cluster’s availability.

| username: 裤衩儿飞上天 | Original post link

If you don’t manually edit the meta file, will the drainer cluster become unavailable just by restarting it?
When it becomes unavailable, what error does the client report?
Are there any relevant logs?