TiKV Unexpected Restart, TiKV Leader Drops, Query Failures During the Time Period

translator_bot · June 22, 2024, 10:32pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiKV莫名重启，TiKV leader掉底，时间段内查询失败

| username: 阿ken

【TiDB Usage Environment】Production Environment
【TiDB Version】v5.1.1
【Reproduction Path】nothing
【Encountered Problem: Phenomenon and Impact】
One TiKV leader in the cluster drops periodically, with no fixed time interval. Checking the TiKV logs reveals that the TiKV was restarted at those times.
Log context as follows:
【Resource Configuration】
【Attachments: Screenshots/Logs/Monitoring】

translator_bot · June 22, 2024, 10:32pm

| username: WalterWj | Original post link

Go to the TiKV GitHub to submit an issue. It seems like you’ve encountered a bug. If it’s a known issue, you might consider upgrading. According to the logs, it’s version 5.1.1. Upgrading to a newer version might also have a fix.

translator_bot · June 22, 2024, 10:32pm

| username: 我是咖啡哥 | Original post link

IO error, Cannot allocate memory.
Check the OS logs?

translator_bot · June 22, 2024, 10:32pm

| username: 阿ken | Original post link

There are no OOM-related entries in the system logs.

The machine has 128GB of memory, and there was no memory increase before the restart. Recently, the restarts have become more frequent.

translator_bot · June 22, 2024, 10:32pm

| username: h5n1 | Original post link

I see there is a similar issue that is still open.

translator_bot · June 22, 2024, 10:32pm

| username: 阿ken | Original post link

Thank you.
Strangely, the restart only happens on this particular TiKV. Not sure if taking this TiKV offline would solve the issue.

translator_bot · June 22, 2024, 10:32pm

| username: 裤衩儿飞上天 | Original post link

Not enough memory? Is there a memory failure? Check the OS logs.

translator_bot · June 22, 2024, 10:32pm

| username: 阿ken | Original post link

There are no issues with the system logs or memory. The lifespan of the data disk storing TiKV is relatively low. Could this be the cause? We will observe after replacing the disk.

translator_bot · June 22, 2024, 10:32pm

| username: Jiawei | Original post link

I want to know if there is any IO jitter when an issue occurs.

translator_bot · June 22, 2024, 10:32pm

| username: Jiawei | Original post link

It looks like the detailed error information type is an IO error, one occurred during flush and the other during compaction. Check if the corresponding TiKV leader dropped due to full IO. I have a friend who also experienced the TiKV leader dropping, then coming back up and dropping again.

translator_bot · June 22, 2024, 10:32pm

| username: 阿ken | Original post link

It looks like the spike occurs after the restart and then gradually decreases.

translator_bot · June 22, 2024, 10:32pm

| username: 秋枫之舞 | Original post link

Based on this chart and the error message, the likelihood of an exception due to insufficient memory is quite high. The memory usage exceeds 80% in the chart, so any slight fluctuation in memory usage could lead to an out-of-memory (OOM) condition.

Execute journalctl -S '2022-11-29 00:20:00' -U '2022-11-29 01:00:00' as the root user to check for any clues before and after the specified time.

translator_bot · June 22, 2024, 10:32pm

| username: 阿ken | Original post link

After replacing the disk, this issue has not reoccurred. It is likely that the error was caused by the low lifespan of the disk.

translator_bot · June 22, 2024, 10:32pm

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.