The logic for determining whether TiDB or PD nodes have restarted in the alerts can be optimized

translator_bot · June 22, 2024, 8:59pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 告警里判断tidb或pd等节点是否重启的逻辑可以优化一下

| username: Csong

【TiDB Usage Environment】Production Environment / Testing / Poc
【TiDB Version】5.3.3
【Encountered Problem: Phenomenon and Impact】
The production system has been inexplicably alarming for the past two days, with both PD and TiDB reporting node restart alarms.

func (s ProcStat) StartTime() (float64, error) {
    fs := FS{proc: s.proc}
    stat, err := fs.Stat()
    if err != nil {
        return 0, err
    }
    return float64(stat.BootTime) + (float64(s.Starttime) / userHZ), nil
}

Looking at the code, the value is obtained by adding the system time point to the duration corresponding to the clock ticket allocated by the system when the process starts, resulting in the process start time point.

0 */6 * * * /usr/sbin/ntpdate ntp.cloud.aliyuncs.com ntp7.cloud.aliyuncs.com > /dev/null 2>&1

The ntpdate synchronizes every 6 hours, and at 18:00 it forcibly synchronizes and sets the system time, increasing it by 1 second, but it does not modify the clock ticket.

This causes the value returned by the StartTime function to increase by 1 second, corresponding to the grafana graph:

The official rule for PD or TiDB restart is:

We need to modify the logic here and add an interval time >10s to optimize this issue.

process_start_time_seconds{job="tidb"} - (process_start_time_seconds{job="tidb"} offset 1m) >= 10

translator_bot · June 22, 2024, 8:59pm

| username: Csong | Original post link

Personally, I think this part needs some optimization, just a small detail.

translator_bot · June 22, 2024, 8:59pm

| username: ohammer | Original post link

Professional, starting directly with source code level troubleshooting.

translator_bot · June 22, 2024, 8:59pm

| username: 会飞的土拨鼠 | Original post link

I haven’t looked at this source code; the expert’s analysis is professional.

translator_bot · June 22, 2024, 8:59pm

| username: chaojiwudidashuaige | Original post link

Your analysis is well-founded. Thumbs up.

translator_bot · June 22, 2024, 8:59pm

| username: Csong | Original post link

Replying earns points.

translator_bot · June 22, 2024, 8:59pm

| username: Csong | Original post link

Indeed, you need to look carefully; just looking at the uptime won’t reveal the issue.

translator_bot · June 22, 2024, 8:59pm

| username: Csong | Original post link

Actually, it’s the code of Prometheus.

translator_bot · June 22, 2024, 8:59pm

| username: Csong | Original post link

Haha, thanks a lot.

translator_bot · June 22, 2024, 8:59pm

| username: Hacker_Yv76YjBL | Original post link

Professional, the analysis by the expert is well-founded.

translator_bot · June 22, 2024, 8:59pm

| username: tidb菜鸟一只 | Original post link

Got it.

translator_bot · June 22, 2024, 8:59pm

| username: Billmay表妹 | Original post link

If this is a product requirement, please provide feedback according to the following requirements:

[Problem Scenario Involved in the Requirement]

[Expected Requirement Behavior]

[Alternative Solutions for the Requirement]

[Background Information]
Such as which users will benefit from it, and some usage scenarios. Any API design, models, or diagrams would be more helpful.

translator_bot · June 22, 2024, 8:59pm

| username: ddhe9527 | Original post link

Is there a 1-second time jump every 6 hours when synchronizing time? This time is too inaccurate!

translator_bot · June 22, 2024, 8:59pm

| username: Csong | Original post link

There will be a few times, but not frequently.