Frequent OOM in TiDB, can someone help me analyze the SQL?

translator_bot · June 21, 2024, 8:16pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: Tidb频繁oom大佬帮我分析下SQL

| username: jaybing926

[TiDB Usage Environment] Production Environment
[TiDB Version]
Cluster version: v5.4.3
[Encountered Problem: Phenomenon and Impact]
Recently, due to business changes, queries have caused frequent OOM on TiDB nodes, with severe memory usage.
I don’t understand SQL optimization very well, can someone help me analyze if there’s any issue with the SQL?

[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page

[Attachments: Screenshots/Logs/Monitoring]

The following is a SQL that consumes a lot of memory found in the TiDB Dashboard. Is the 778M memory consumption of this SQL problematic? Is this SQL causing frequent OOM due to insufficient memory?

translator_bot · June 21, 2024, 8:16pm

| username: 像风一样的男子 | Original post link

For troubleshooting, you should check the topsql at the time of the OOM to see which SQL statements are consuming more resources. To prevent the system from becoming unavailable due to OOM, you need to limit the maximum memory for each query.

translator_bot · June 21, 2024, 8:16pm

| username: Soysauce520 | Original post link

The statistics are outdated, you can collect them first and take a look. Pull the execution plan back, and you can see the actual actrows.

translator_bot · June 21, 2024, 8:16pm

| username: 有猫万事足 | Original post link

Why has your PD been selected by the OOM-killer more times than TiDB? The most ridiculous thing is that even zabbix_agentd has been selected. I remember this zabbix process uses no more than 100M of memory. I can’t understand how it could be selected by the OOM-killer.

translator_bot · June 21, 2024, 8:16pm

| username: jaybing926 | Original post link

It should be caused by insufficient memory, all the above processes have been killed~

translator_bot · June 21, 2024, 8:16pm

| username: caiyfc | Original post link

It seems that there might not be enough memory, causing various processes to compete for memory, and some processes triggered the OOM killer.

translator_bot · June 21, 2024, 8:16pm

| username: 像风一样的男子 | Original post link

What are the specifications of your servers?

translator_bot · June 21, 2024, 8:16pm

| username: jaybing926 | Original post link

20-core 64GB

translator_bot · June 21, 2024, 8:16pm

| username: 裤衩儿飞上天 | Original post link

There is also Redis…

translator_bot · June 21, 2024, 8:16pm

| username: 像风一样的男子 | Original post link

First, take a look at this under memory constraints, and then optimize the slow SQL queries one by one:

translator_bot · June 21, 2024, 8:16pm

| username: 有猫万事足 | Original post link

It really shouldn’t be.
The oom_score of TiDB

[root@tidb1 ~]# cat /proc/20428/oom_score
559

The oom_score of PD

[root@tidb1 ~]# cat /proc/17744/oom_score
52

I also have TiDB and PD on the same server. Normally, the score of PD is much lower than that of TiDB. It shouldn’t be selected by the oom-killer so frequently.

translator_bot · June 21, 2024, 8:16pm

| username: jaybing926 | Original post link

I checked, and it’s all this SQL. A lot of queries are occupying 1.6 GiB, which is larger than the 778M I mentioned above.

translator_bot · June 21, 2024, 8:16pm

| username: jaybing926 | Original post link

I had already changed the memory limit for a single SQL to 3G before, and I’m afraid reducing it further might affect normal business operations.

translator_bot · June 21, 2024, 8:16pm

| username: jaybing926 | Original post link

An SQL query of 1.6G, no amount of memory is enough. So, it’s still necessary to optimize the SQL. Is there still room for optimization in the SQL statement I mentioned above? Can it use less memory?

translator_bot · June 21, 2024, 8:16pm

| username: jaybing926 | Original post link

An SQL query of 1.6G is too many requests. The priority doesn’t make much sense anymore. Just kill them all. There’s also a single-point Redis above that should be killed too.

translator_bot · June 21, 2024, 8:16pm

| username: 有猫万事足 | Original post link

Indeed. I thought it was a hardware issue.

translator_bot · June 21, 2024, 8:16pm

| username: TiDBer_oHSwKxOH | Original post link

You deployed it on one machine.

translator_bot · June 21, 2024, 8:16pm

| username: jaybing926 | Original post link

Given limited resources, I run TiDB and PD on the same node.

translator_bot · June 21, 2024, 8:16pm

| username: 有猫万事足 | Original post link

What I’m thinking now is to try executing this SQL with TiFlash and see what happens.

The main reason for high memory usage is that the aggregation is happening on TiDB. It would be best to push down this aggregation. By doing it on TiFlash, TiDB can directly receive the results.

translator_bot · June 21, 2024, 8:16pm

| username: tidb菜鸟一只 | Original post link

Redis should stay out of this, just switch to another machine…