PD went down at night, reason unknown

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: pd 在夜间down了,不知道什么原因

| username: 突破边界

[Test Environment for TiDB] Testing
[TiDB Version] 7.5.0
[Reproduction Path] PD crashed around 1 AM for unknown reasons. The PD logs are as follows:

[2024/05/08 01:27:00.398 +08:00] [INFO] [grpc_service.go:1948] ["update service GC safe point"] 
[service-id=gc_worker] [expire-at=-9223372035139672989] [safepoint=449603756461391872]
[2024/05/08 01:28:40.520 +08:00] [INFO] [grpc_service.go:1893] ["updated gc safe point"] [safe-p
oint=449603756461391872]
[2024/05/08 01:37:00.396 +08:00] [INFO] [grpc_service.go:1948] ["update service GC safe point"] 
[service-id=gc_worker] [expire-at=-9223372035139672389] [safepoint=449603913747791872]
[2024/05/08 01:38:37.465 +08:00] [INFO] [lease.go:187] ["stop lease keep alive worker"] [purpose
="leader election"]
[2024/05/08 01:38:37.466 +08:00] [INFO] [allocator_manager.go:772] ["exit allocator daemon"] []
[2024/05/08 01:38:37.466 +08:00] [INFO] [coordinator.go:160] ["patrol regions has been stopped"]
[2024/05/08 01:38:37.466 +08:00] [INFO] [coordinator.go:344] ["drive slow node scheduler is stop
ped"]
[2024/05/08 01:38:37.466 +08:00] [INFO] [coordinator.go:326] ["drive push operator has been stop
ped"]
[2024/05/08 01:38:37.466 +08:00] [INFO] [allocator_manager.go:316] ["exit allocator loop"] []
[2024/05/08 01:38:37.466 +08:00] [INFO] [scheduler_controller.go:364] ["scheduler has been stopp
ed"] [scheduler-name=balance-hot-region-scheduler] [error="context canceled"]
[2024/05/08 01:38:37.466 +08:00] [INFO] [coordinator.go:374] ["coordinator is stopping"]
[2024/05/08 01:38:37.466 +08:00] [INFO] [scheduler_controller.go:364] ["scheduler has been stopped"] [scheduler-name=balance-leader-scheduler] [error="context canceled"]
[2024/05/08 01:38:37.466 +08:00] [INFO] [main.go:284] ["got signal to exit"] [signal=hangup]
[2024/05/08 01:38:37.466 +08:00] [INFO] [server.go:127] ["region syncer has been stopped"]
[2024/05/08 01:38:37.466 +08:00] [INFO] [scheduler_controller.go:364] ["scheduler has been stopped"] [scheduler-name=transfer-witness-leader-scheduler] [error="context canceled"]

The subsequent logs are all about stopping various modules. Could it be related to these log messages?
stop lease keep alive worker
drive slow node scheduler is stop
drive push operator has been stop
[Encountered Problem: Phenomenon and Impact] PD crashed around midnight. How should I further investigate the cause? This doesn’t seem to be the first time it has happened.

| username: Kongdom | Original post link

:thinking: Could it be a scheduled task at 1 AM?

| username: 突破边界 | Original post link

The database itself does not have scheduled jobs, nor does the business.

| username: DBAER | Original post link

It seems necessary to look at the code. Normally, PD is very stable. Is your network functioning properly?

| username: 呢莫不爱吃鱼 | Original post link

The current information doesn’t reveal the cause. Can you provide more details?

| username: Ming | Original post link

Check if there is any information output in the pd_stderr.log file.

| username: 小龙虾爱大龙虾 | Original post link

This means that PD received an exit signal from the operating system and exited. Check the operating system logs to see if the oom-killer was triggered.

| username: TiDBer_QYr0vohO | Original post link

Check the resource monitoring of the server during that period.

| username: Kongdom | Original post link

Is it a virtual machine environment? We once encountered an issue with a virtual machine where other virtual machines needed to perform backups at night, causing resources to be skewed towards them and resulting in insufficient resources for the virtual machine running TiDB.

| username: changpeng75 | Original post link

Could it be triggered by GC?

| username: 突破边界 | Original post link

I checked, and it seems there is no oom-killer.

| username: 突破边界 | Original post link

It’s not a virtual machine, the machine performance is quite good, with 160GB of memory and 64 cores.

| username: 突破边界 | Original post link

At that time, the resource situation should not be viewable now. My server configuration is quite good, with 160GB of memory and 64 cores.

| username: 突破边界 | Original post link

I installed it on a single machine, so the network shouldn’t be affected.

| username: 突破边界 | Original post link

There are no errors in pd_stderr.log

| username: 随缘天空 | Original post link

Check the dashboard for any slow SQL queries around 1 AM, and search the relevant log module for error-level log information around that time.

| username: DBAER | Original post link

Did you deploy this on a playground? It’s possible that multiple components are competing for resources on one machine.

| username: tidb菜鸟一只 | Original post link

You have placed multiple components on one machine, right? Check the resource usage of other components.

| username: 小龙虾爱大龙虾 | Original post link

Has the operating system rotated the logs?

| username: TIDB-Learner | Original post link

The original poster’s background situation is as follows: They have a single-machine mixed deployment TiDB test cluster. The PD crashes at a fixed time around 1 AM.

  1. There are no batch processes.
  2. In the test environment, there is no significant data processing.
  3. Checking the system logs did not reveal any abnormal information.
    The info logs in the screenshot also show nothing special.

Questions:

  • Are there any other systems deployed besides TiDB?
  • Does TiDB have any special configurations, such as resource control?
  • Additionally, check for any special scheduled tasks using crontab -e.

If the issue occurs at a fixed time, it is generally caused by a manual configuration.