Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.Original topic: 告警里判断tidb或pd等节点是否重启的逻辑可以优化一下

【TiDB Usage Environment】Production Environment / Testing / Poc
【TiDB Version】5.3.3
【Encountered Problem: Phenomenon and Impact】
The production system has been inexplicably alarming for the past two days, with both PD and TiDB reporting node restart alarms.
func (s ProcStat) StartTime() (float64, error) {
fs := FS{proc: s.proc}
stat, err := fs.Stat()
if err != nil {
return 0, err
}
return float64(stat.BootTime) + (float64(s.Starttime) / userHZ), nil
}
Looking at the code, the value is obtained by adding the system time point to the duration corresponding to the clock ticket allocated by the system when the process starts, resulting in the process start time point.
0 */6 * * * /usr/sbin/ntpdate ntp.cloud.aliyuncs.com ntp7.cloud.aliyuncs.com > /dev/null 2>&1
The ntpdate synchronizes every 6 hours, and at 18:00 it forcibly synchronizes and sets the system time, increasing it by 1 second, but it does not modify the clock ticket.
This causes the value returned by the StartTime function to increase by 1 second, corresponding to the grafana graph:
The official rule for PD or TiDB restart is:
We need to modify the logic here and add an interval time >10s to optimize this issue.
process_start_time_seconds{job="tidb"} - (process_start_time_seconds{job="tidb"} offset 1m) >= 10