Tidb-server Restarts Intermittently

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb-server不定时重启

| username: EDG-给我冲

【TiDB Usage Environment】Production Environment
【TiDB Version】v6.1.0
【Encountered Problem: Phenomenon and Impact】
Multiple tidb-server nodes restart automatically at irregular intervals.
【Known Information】
Around the time of the failure, there were no anomalies in CPU, memory, or load on the server.
Below are parts of the tidb.log and tidb_stderr.log contents before the restart.
【Attachments: Screenshots/Logs/Monitoring】


a.log (66.1 KB)

| username: h5n1 | Original post link

In the dashboard, check if continuous performance analysis is enabled, in the advanced debugging section.

| username: EDG-给我冲 | Original post link

This feature is not enabled.

| username: h5n1 | Original post link

Check TiDB’s memory monitoring to see if it has OOMed, and also check dmesg.

| username: h5n1 | Original post link

It feels like this bug.

| username: EDG-给我冲 | Original post link

dmesg shows no recent OOM, and /var/log/message has the following records at that time. Apr 28 08:02:19 TIDB_2 auditd[599]: Audit daemon rotating log files
Apr 28 08:10:01 TIDB_2 systemd: Started Session 50725 of user root.
Apr 28 08:20:01 TIDB_2 systemd: Started Session 50726 of user root.
Apr 28 08:30:01 TIDB_2 systemd: Started Session 50727 of user root.
Apr 28 08:36:42 TIDB_2 systemd: tidb-4000.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 28 08:36:42 TIDB_2 systemd: Unit tidb-4000.service entered failed state.
Apr 28 08:36:42 TIDB_2 systemd: tidb-4000.service failed.
Apr 28 08:36:57 TIDB_2 systemd: tidb-4000.service holdoff time over, scheduling restart.
Apr 28 08:36:57 TIDB_2 systemd: Stopped tidb service.
Apr 28 08:36:57 TIDB_2 systemd: Started tidb service.

| username: h5n1 | Original post link

It is recommended to upgrade to the latest version 6.1.7 of 6.1.

| username: Jack-li | Original post link

Try upgrading to the new version to see if there are any optimizations.

| username: DBAER | Original post link

It feels like a bug. Is it consistently appearing? Can you try an upgrade?

| username: EDG-给我冲 | Original post link

From my perspective, it may not be a bug. From my point of view, we made some adjustments to our program architecture. Then suddenly this problem occurred. If it were a bug, it wouldn’t suddenly appear. After all, this version has been used for a long time. Are there any other points to investigate? If we can avoid upgrading, we should avoid it. How significant is the impact of a hot upgrade?

| username: EDG-给我冲 | Original post link

It feels quite stable. Since the day before yesterday, it has happened a few times every day.

| username: EDG-给我冲 | Original post link

The impact is too significant. I don’t dare to upgrade.

| username: 呢莫不爱吃鱼 | Original post link

Business adjustments caused an OOM (Out of Memory) error?

| username: EDG-给我冲 | Original post link

No, it’s not an OOM (Out of Memory).

| username: Jellybean | Original post link

The original poster should first check the cluster’s Dashboard and TiDB-related Grafana panels, which should provide some useful information.

| username: xiaoqiao | Original post link

If it’s not an OOM issue, try upgrading in a different environment to see if it can be reproduced?

| username: EDG-给我冲 | Original post link

:sob: Sigh, I didn’t see any slow SQL in the dashboard, and I didn’t find any anomalies when looking at the curves in Grafana. It’s too difficult.

| username: h5n1 | Original post link

Upgrade to the latest version, it will surely fix the issues. Generally, the .0 versions are full of bugs.

| username: EDG-给我冲 | Original post link

How much impact does online hot upgrade have on business? How to upgrade with minimal disruption?

| username: shigp_TIDBER | Original post link

It doesn’t seem like a bug, after all, it was quite stable before. Monitoring should be done from the disk and memory aspects, and then analyzed.