What is the recommended approach for handling server reboots in a TiDB cluster?

Application environment:

Development

TiDB version:

Unsure

Reproduction method:

Problem:

I have a very old legacy TiDB cluster we inherited across multiple servers, I see that there are a few different services like PD, TiKV, and TiDB. Using systemctl start service we started services on each server.

In the event one of the servers has to be rebooted for maintenance what is the correct order in which to start the TiDB services? And is it recommended to use some kind of startup script on the specific server so we could automate it?

Resource allocation:

N/A

Attachment:

N/A

You are correct to be concerned about the order of starting TiDB services after a server reboot. The order is crucial for maintaining the integrity and consistency of your cluster. Here’s the recommended order, along with why it’s important:

  1. Placement Driver (PD): PD is the brain of the TiDB cluster, responsible for cluster management, metadata storage, and allocating cluster IDs. It’s the foundation of the cluster.

  2. TiKV Server: TiKV is the distributed key-value storage engine used by TiDB. It stores the actual data. It needs to be started after PD is up and running so it can connect to PD and receive cluster information.

  3. Pump: Pump is responsible for collecting and forwarding TiDB’s binlog (binary log) to the drainer. It’s crucial for replication and data consistency. Starting it after PD and TiKV ensures that TiDB has a working connection to Pump when it starts.

  4. TiDB Server: TiDB is the SQL layer of the TiDB ecosystem. It interacts with TiKV for data access and relies on Pump for binlog transmission. Starting it last ensures that all other components are up and running.

It is highly recommended to use a startup script on each server to automate the process. This ensures consistent and reliable restarts. Here’s a basic example:

#!/bin/bash

# Start PD
systemctl start pd-server

# Wait for PD to start completely
sleep 10

# Start TiKV
systemctl start tikv-server

# Wait for TiKV to start completely
sleep 10

# Start Pump
systemctl start pump

# Wait for Pump to start completely
sleep 10

# Start TiDB
systemctl start tidb-server

Remember to tailor this script to your specific needs, including:

  • Custom paths: Adjust the paths to the service executables or configuration files if they differ from the defaults.
  • Additional services: Include commands to start any other TiDB services you have running on the server.
  • Error handling: Implement error checks and logging to ensure that the script runs smoothly and informs you if any service fails to start.

This automated approach ensures that your TiDB cluster services are started in the correct order even during unplanned server reboots, minimizing downtime and maximizing cluster stability.

: TiDB Binlog Tutorial | PingCAP Docs

This depends on the deployment method. Is this cluster deployed with “tiup cluster” or with the TiDB Operator for Kubernetes? Or if it is really old maybe with ansible?

Hi,

This is a pretty old legacy cluster that won’t be around for much longer, it was originally deployed around 6 years ago using Ansible. I believe this was the repository used: GitHub - pingcap/tidb-ansible

So we’re looking to make sure if a server it’s running on needs to be rebooted for maintenance we can effectively stay or get back into a running state. And to determine if utilizing on reboot startup scripts to restart the services would be a recommended path to take.