TiKV and TiDB have been experiencing frequent restarts recently

translator_bot · June 23, 2024, 11:32am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv和tidb最近一直出现重启现象

| username: laofeizhu

【TiDB Usage Environment】Production Environment
【TiDB Version】v5.2.1
【Encountered Problem】Frequent restarts of TiKV and TiDB recently
【Reproduction Path】Cannot reproduce
【Problem Phenomenon and Impact】
Impact: During leader election after a restart, it is very sluggish, and SQL query latency is very high.
Problem Phenomenon: The TiKV log shows “welcome tikv” (indicating a restart, with a log gap of several seconds before and after)

Monitoring files: b2b-Overview_2022-06-30T07_27_39.411Z.json (2.2 MB) b2b-TiKV-Details_2022-06-30T06_11_35.859Z.json (25.5 MB) b2b-TiDB_2022-06-30T07_26_34.011Z.json (5.1 MB) b2b-Overview_2022-06-30T07_27_39.411Z.json
b2b-TiKV-Details_2022-06-30T06_11_35.859Z.json
b2b-TiDB_2022-06-30T07_26_34.011Z.json

translator_bot · June 23, 2024, 11:32am

| username: BraveChen | Original post link

Take a look at the logs of TiKV and TiDB.

translator_bot · June 23, 2024, 11:32am

| username: laofeizhu | Original post link

One moment, I’ll export the monitoring data shortly.

translator_bot · June 23, 2024, 11:32am

| username: BraveChen | Original post link

There is a log folder in the deployment directory of each component of your cluster. You can check the logs there, which is the most straightforward way.

translator_bot · June 23, 2024, 11:32am

| username: laofeizhu | Original post link

This is the nearby abnormal log.

translator_bot · June 23, 2024, 11:32am

| username: laofeizhu | Original post link

Link: 百度网盘请输入提取码 Extraction code: lmnl
–Shared by Baidu Netdisk Super Member v6. The complete content is stored in Baidu Netdisk.

translator_bot · June 23, 2024, 11:32am

| username: TiDBer_wTKU9jv6 | Original post link

Check /var/log/messages to see the reason for the restart. Could it be due to an OOM (Out of Memory) issue?

translator_bot · June 23, 2024, 11:32am

| username: BraveChen | Original post link

It’s possible. Check the memory usage monitoring at that time.

translator_bot · June 23, 2024, 11:32am

| username: BraveChen | Original post link

From the monitoring, it looks like a typical OOM. The machine 56 experienced an OOM.

translator_bot · June 23, 2024, 11:32am

| username: BraveChen | Original post link

Go find the reason!

translator_bot · June 23, 2024, 11:32am

| username: laofeizhu | Original post link

Okay, thank you very much.

translator_bot · June 23, 2024, 11:32am

| username: laofeizhu | Original post link

The issue was finally resolved. The specific problem was due to a memory leak after TiDB GC failure, causing the memory to only increase and not decrease. Since TiDB and TiKV were deployed on the same node, when the system killed the application, it killed TiKV. After TiKV restarted, TiDB was killed again due to insufficient memory. Finally, by investigating the GC anomalies (mainly errors concentrated on analyze taking too long), we adjusted the tidb_gc_life_time parameter. The remaining issue was that the field was too short, causing collection failures (统计信息收集报错 Data too long for column 'upper_bound' - TiDB 的问答社区).

translator_bot · June 23, 2024, 11:32am

| username: tidb狂热爱好者 | Original post link

This is a mixed deployment. PD, TiKV, and TiDB must be deployed separately.
If you do mixed deployment, you must limit the memory usage of TiKV and TiDB.
Otherwise, a 1GB SQL query will cause an OOM.
You can’t control developers writing bad SQL, so it’s best to have TiDB on a separate machine.

translator_bot · June 23, 2024, 11:32am

| username: laofeizhu | Original post link

Yes, mainly because resources are not enough right now, we still need to wait for a while.

translator_bot · June 23, 2024, 11:32am

| username: BraveChen | Original post link

Yes, that makes sense.

translator_bot · June 23, 2024, 11:32am

| username: system | Original post link

This topic was automatically closed 1 minute after the last reply. No new replies are allowed.