After TiDB OOM, restarting the virtual machine causes cluster startup errors

translator_bot · June 23, 2024, 12:21am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB oom之后重启虚拟机启动集群出错

| username: TiDBer_5M9L07sN

[Test Environment] Testing environment
[TiDB Version]
[Encountered Problem] After TiDB OOM, the machine became sluggish and the server was restarted. Upon rebooting and starting the cluster, the following issue appeared as shown in the image below:

[Reproduction Path]
[Problem Phenomenon and Impact]

translator_bot · June 23, 2024, 12:21am

| username: TiDBer_5M9L07sN | Original post link

I think the main reason is that the execution plan is different. You can use explain analyze to see the execution plan and the time spent on each operator.

translator_bot · June 23, 2024, 12:21am

| username: xfworld | Original post link

How did you restart it? The logs describe that PD crashed~

If restarting still can’t restore the PD state, it’s likely that data loss occurred, causing the service to malfunction.

You can only use PD’s unsafe recovery method… (which will probably result in data loss)

translator_bot · June 23, 2024, 12:21am

| username: TiDBer_5M9L07sN | Original post link

Perform the related recovery operations.

translator_bot · June 23, 2024, 12:21am

| username: tidb狂热爱好者 | Original post link

Can’t imagine restarting just because of OOM (Out of Memory).

translator_bot · June 23, 2024, 12:21am

| username: Hacker007 | Original post link

Shouldn’t you avoid restarting all the servers at once? Can’t you manually start each instance now?

translator_bot · June 23, 2024, 12:21am

| username: TiDBer_5M9L07sN | Original post link

Yes, is there any way to fix it?

translator_bot · June 23, 2024, 12:21am

| username: Hacker007 | Original post link

What is your startup sequence? Try starting PD first, then the TiDB server, and finally start TiKV.

translator_bot · June 23, 2024, 12:21am

| username: TiDBer_5M9L07sN | Original post link

The current startup sequence for a single node is that pd, tikv, and tidb have all successfully started; however, it is inaccessible.

translator_bot · June 23, 2024, 12:21am

| username: Hacker007 | Original post link

Check the TiDB logs.

translator_bot · June 23, 2024, 12:21am

| username: TiDBer_5M9L07sN | Original post link

Directly started from the command line, saw a fatal message in TiKV

Didn’t see it in other PD and TiDB

translator_bot · June 23, 2024, 12:21am

| username: TiDBer_CEVsub | Original post link

It is recommended to perform data recovery.

translator_bot · June 23, 2024, 12:21am

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.