The logic for determining whether TiDB or PD nodes have restarted in the alerts can be optimized

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 告警里判断tidb或pd等节点是否重启的逻辑可以优化一下

| username: Csong

【TiDB Usage Environment】Production Environment / Testing / Poc
【TiDB Version】5.3.3
【Encountered Problem: Phenomenon and Impact】
The production system has been inexplicably alarming for the past two days, with both PD and TiDB reporting node restart alarms.

image

func (s ProcStat) StartTime() (float64, error) {
    fs := FS{proc: s.proc}
    stat, err := fs.Stat()
    if err != nil {
        return 0, err
    }
    return float64(stat.BootTime) + (float64(s.Starttime) / userHZ), nil
}

Looking at the code, the value is obtained by adding the system time point to the duration corresponding to the clock ticket allocated by the system when the process starts, resulting in the process start time point.

0 */6 * * * /usr/sbin/ntpdate ntp.cloud.aliyuncs.com ntp7.cloud.aliyuncs.com > /dev/null 2>&1

The ntpdate synchronizes every 6 hours, and at 18:00 it forcibly synchronizes and sets the system time, increasing it by 1 second, but it does not modify the clock ticket.

This causes the value returned by the StartTime function to increase by 1 second, corresponding to the grafana graph:

The official rule for PD or TiDB restart is:

image

We need to modify the logic here and add an interval time >10s to optimize this issue.

process_start_time_seconds{job="tidb"} - (process_start_time_seconds{job="tidb"} offset 1m) >= 10
| username: Csong | Original post link

Personally, I think this part needs some optimization, just a small detail.

| username: ohammer | Original post link

Professional, starting directly with source code level troubleshooting.

| username: 会飞的土拨鼠 | Original post link

I haven’t looked at this source code; the expert’s analysis is professional.

| username: chaojiwudidashuaige | Original post link

Your analysis is well-founded. Thumbs up.

| username: Csong | Original post link

Replying earns points.

| username: Csong | Original post link

Indeed, you need to look carefully; just looking at the uptime won’t reveal the issue.

| username: Csong | Original post link

Actually, it’s the code of Prometheus.

| username: Csong | Original post link

:smile: Haha, thanks a lot.

| username: Hacker_Yv76YjBL | Original post link

Professional, the analysis by the expert is well-founded.

| username: tidb菜鸟一只 | Original post link

Got it.

| username: Billmay表妹 | Original post link

If this is a product requirement, please provide feedback according to the following requirements:

[Problem Scenario Involved in the Requirement]

[Expected Requirement Behavior]

[Alternative Solutions for the Requirement]

[Background Information]
Such as which users will benefit from it, and some usage scenarios. Any API design, models, or diagrams would be more helpful.

| username: ddhe9527 | Original post link

Is there a 1-second time jump every 6 hours when synchronizing time? This time is too inaccurate!

| username: Csong | Original post link

There will be a few times, but not frequently.