Troubleshooting TiDB Server OOM Issues

translator_bot · June 20, 2024, 10:21pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb-server oom问题排查

| username: jaybing926

Version: v5.4.3
Issue: Recently, I have been receiving the TIDB_node_restart alert frequently. Upon checking the monitoring, it appears that high memory usage is the cause. Due to the diverse and extensive usage by various business operations, I want to pinpoint the specific problematic SQL. How can I locate it?

Here are some screenshots of the issue:

I have tried searching for slow query SQL on the Dashboard, but it seems I couldn’t find any SQL that consumes such a large amount of memory. I wonder if it’s because the SQL hasn’t completed execution, so the logs haven’t recorded it. How can I specifically locate it?

translator_bot · June 20, 2024, 10:21pm

| username: tidb狂热爱好者 | Original post link

You have already located it, it’s these SQLs.

translator_bot · June 20, 2024, 10:21pm

| username: wluckdog | Original post link

When checking TiDB logs, there will be expensive SQL statements.

translator_bot · June 20, 2024, 10:21pm

| username: TiDBer_小阿飞 | Original post link

translator_bot · June 20, 2024, 10:21pm

| username: tidb狂热爱好者 | Original post link

Scold the person who wrote the SQL. Then lock them in a small dark room. Deploy their group’s application separately in a TiDB server frontend component. Upgrade their dedicated line. If any TiDB instance crashes, their own business will be unavailable without affecting others.

translator_bot · June 20, 2024, 10:21pm

| username: tidb菜鸟一只 | Original post link

Please share the execution plan for the first SQL statement you mentioned above.

translator_bot · June 20, 2024, 10:21pm

| username: kelvin | Original post link

This is possible.

translator_bot · June 20, 2024, 10:21pm

| username: zhimadi | Original post link

It’s only 4 entries totaling over 400MB, it shouldn’t be an issue. Our cluster used to have just 16GB*3 of memory, and at night, the statistics application often handled multiple entries with a maximum of 1GB without any problems. Could it be due to some other reason?

translator_bot · June 20, 2024, 10:21pm

| username: jaybing926 | Original post link

Look at my monitoring data. When I encountered an OOM, the memory usage reached over 60GB, but in the dashboard, it only shows a few hundred MB. It doesn’t match up.

translator_bot · June 20, 2024, 10:21pm

| username: 源de爸 | Original post link

First optimize the top 10 SQL queries and then observe the results. Often, it is not a single SQL query causing the issue.

translator_bot · June 20, 2024, 10:21pm

| username: Inkjade | Original post link

Conduct a performance analysis when there are machine issues. Analyze the flame graph and combine it with slow SQL to identify the specific reasons.

translator_bot · June 20, 2024, 10:21pm

| username: 小于同学 | Original post link

This is possible.

translator_bot · June 20, 2024, 10:21pm

| username: 饭光小团 | Original post link

Generally, we look at the expensive logs in the TiDB logs.

translator_bot · June 20, 2024, 10:21pm

| username: 小龙虾爱大龙虾 | Original post link

The slow log is recorded when the SQL execution ends. When an OOM occurs, the process is directly killed, so it is indeed possible that it is not recorded. For the logs that are recorded, you can first check why there is excessive memory usage. Then, use the log search function on the Dashboard to search the logs of the TiDB server by using the keyword “expensive” to find some SQL statements.

translator_bot · June 20, 2024, 10:21pm

| username: Hacker_QGgM2nks | Original post link

Directly set up monitoring to check for the keyword “expensive” in the logs, but some might not be captured.

translator_bot · June 20, 2024, 10:21pm

| username: DBAER | Original post link

Optimize it yourself based on the dashboard.