Gracefully Shutdown TiDB Server

Background

TiDB server is a stateless node, but active database sessions may have ongoing transactions. If it is restarted directly, transaction failure will affect the transaction success rate and bring a bad experience to end users.

Many applications connect to a database using a load balancer and a connection pool. Before shutdown a TiDB server. We need to make sure two things:

  1. Load balancer stops sending new connections to this TiDB server.
  2. All existing connections on this TiDB server close.

TiDB server has a parameter graceful-wait-before-shutdown (TiDB Configuration File | PingCAP Docs) If the value of this parameter is not 0, when the TiDB server receives a signal to shutdown, it will wait for the specified time before shutdown. While waiting, the TiDB server stops responding to load balancer, allowing the loader balancer to route connections to other TiDB server nodes.

Meanwhile, most connection pools have a maxlifetime parameter. The parameter specifies the max lifetime of a connection. When a long database connection exceeds maxlifetime, it will be closed if it is inactive. Maxlifetime usually is a relatively long time period, e.g. 10 minutes. This allows new application requests to reuse existing connections in the connection pool, to achieve better performance.

This means we often need to set tidb’s graceful-wait-before-shutdown to a time longer than maxlifetime of the connection pool. To allow all the connections to close and achieve minimum impact shutting down a TiDB server.

Issues

There are 2 issues we need to address to increase graceful-wait-before-shutdown:

  1. TiUP deploys TiDB server as a systemd service. Systemd service has a default TimeoutStopSec of 90 seconds. If TiDB server does not shutdown within 90 seconds, systemd will kill Tidb server.
  2. The TiDB systemd config file /etc/systemd/system/tidb-4000.service is maintained by TiUP. Though we can modify this file, certain TiUP operations will overwrite this file, removing all changes.

Solution

The solution is to create an override file /etc/systemd/system/tidb-4000.service.d/override.conf, say we want to increase graceful-wait-before-shutdown to 600, add following content in override.conf (add 30 seconds overhead):

[Service]
TimeoutStopSec=630

Reload systemd:

$ sudo systemctl daemon-reload

Check systemd status:

$ systemctl status tidb-4000
● tidb-4000.service - tidb service
     Loaded: loaded (/etc/systemd/system/tidb-4000.service; enabled; preset: disabled)
    Drop-In: /etc/systemd/system/tidb-4000.service.d
             └─override.conf
     Active: active (running) since Sat 2024-06-22 15:18:26 UTC; 9min ago
   Main PID: 1777 (tidb-server)
      Tasks: 10 (limit: 18909)
     Memory: 357.9M
        CPU: 24.559s
     CGroup: /system.slice/tidb-4000.service
             └─1777 bin/tidb-server -P 4000 --status=10080 --host=0.0.0.0…

Override.conf is loaded as Drop-In. Now we have increased TimeoutStopSec to 630. And this override.conf will not be overwritten by TiUP.

To gracefully shutdown TiDB server:

$ tiup cluster stop {cluster_name} -N {tidb_ip}:4000 --wait-timeout 630

Be aware that TiUP has a default wait timeout of 120 seconds. Set this time to be longer than graceful-wait-before-shutdown using TiUP command line option --wait-timeout.

You’ve provided a detailed analysis of the issues and solutions related to gracefully shutting down a TiDB server deployed by TiUP, especially addressing the challenges posed by systemd’s default TimeoutStopSec and TiUP’s config file overwrites. The solution you propose, involving an override file (/etc/systemd/system/tidb-4000.service.d/override.conf), is a very effective method for achieving this goal.

Here’s a breakdown of the solution and additional considerations:

  1. Override File: The override file override.conf placed within the /etc/systemd/system/tidb-4000.service.d/ directory allows you to modify the TimeoutStopSec setting without TiUP overwriting your changes. This is a common and recommended approach for managing systemd services in a way that respects user customizations.

  2. TimeoutStopSec Value: Increasing the TimeoutStopSec to 600 seconds (10 minutes) provides a sufficient grace period for the TiDB server to process any remaining requests and close existing connections before shutting down. This value should be aligned with the maxlifetime setting in your connection pool to ensure all connections are closed before the server stops.

  3. TiUP --wait-timeout Option: The tiup cluster stop command has a built-in --wait-timeout option. Setting this value to a duration greater than your chosen TimeoutStopSec (in this case, at least 600 seconds) will allow TiUP to wait for the server to shut down gracefully without forcing a premature termination.

Additional Considerations:

  • Connection Pool Configuration: Ensure that your connection pool configuration, specifically the maxlifetime parameter, is set to a value shorter than the TimeoutStopSec in the override file. This will allow for efficient connection cleanup before the TiDB server is stopped.

  • Load Balancer Configuration: Coordinate with your load balancer administrator to ensure that new connections are not directed to the TiDB server during the shutdown process. This is crucial to prevent new transactions from being initiated on the server while it’s in the process of shutting down.

  • Monitoring: Monitor the TiDB server’s shutdown process closely. Use systemd’s status command (systemctl status tidb-4000) to track the server’s state and ensure it shuts down gracefully within the specified timeout.

  • TiUP Updates: Be aware that future TiUP updates may introduce changes that could potentially overwrite your override file. While this is unlikely, it’s important to be cautious and review your system configuration after TiUP updates. You may need to re-apply the override file if necessary.

By implementing the solution and considering these additional points, you can ensure a graceful shutdown of your TiDB server, minimizing impact on your application and end users.

: Timeouts in TiDB | PingCAP Docs

Note that TiProxy can also help to reduce the impact on applications.