After TiDB-Server OOM, all SQL executions become very slow

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB-Server OOM起来后,所有SQL执行都很慢

| username: 脚本小王子

[TiDB Usage Environment] Production Environment
[TiDB Version] v5.3.0
[Reproduction Path] Cannot be manually reproduced in the production environment
[Encountered Problem: Problem Phenomenon and Impact]
One of the TiDB-Servers experienced an OOM. After the service restarted, executing SQL was slow (or even failed to execute), with a large number of SQL statements piling up. A particular table had very high traffic. Executing show stats_healthy produced no output. Even after stopping the business and killing all SQL, querying small tables still yielded no results. Executing ANALYZE TABLE on a table with only a few records had no response. Only after waiting for about ten minutes, when show stats_healthy produced results, did all SQL queries return to normal.
This problem first occurred on Monday this week and has since happened 1-2 times daily. Previously, OOMs also occurred frequently, but the TiDB-Server was normal after the service restarted.

[Resource Configuration] Enter TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page

[Attachments: Screenshots/Logs/Monitoring]


log.rar (347.8 KB)

| username: 连连看db | Original post link

Read and write slowness might be due to TiKV restarting and re-electing. Additionally, check if the block-cache is configured too large.

| username: zxgaa | Original post link

If the given cache is relatively large, it is relatively slow. Now, when I restart our cluster’s TiDB server, it takes at least one or two hours to load completely.

| username: 脚本小王子 | Original post link

TiKV did not restart, it was the TiDB-Server that restarted.

| username: 脚本小王子 | Original post link

Previously, it often experienced OOM, but it would be fine after restarting. It only started behaving like this since Monday.

| username: yiduoyunQ | Original post link

v5.3.0 has no solution; you can only manually control the frontend LB traffic and let it in after TiDB is ready.

| username: 小龙虾爱大龙虾 | Original post link

Check if the SQL is running with incorrect execution plans, look at the startup logs, and pay attention to init stats.

| username: 像风一样的男子 | Original post link

Why are the configurations of your TiDB nodes so different?

| username: tidb菜鸟一只 | Original post link

When the tidb-server restarts, it will reload the statistics. If the loading time is too long, it will forcibly stop loading and start directly. At this time, some SQLs may go to the wrong execution plan, causing a large amount of data scanning on TiKV and resulting in a large number of slow SQLs in the database. I think you should first analyze the cause of your OOM. Although tidb-server is stateless, frequent OOMs can still affect the business.

| username: 脚本小王子 | Original post link

Thank you for your reply.
The execution plan is not wrong. I took out the SQL that was stuck for a long time and checked the explain, and it was fine. Additionally, when TiDB-Server starts, the logs report “[ERROR] [client.go:845] [“[pd] update connection contexts failed”] [dc=global] [error=“rpc error: code = Canceled desc = context canceled”]” and quite a few “get timestamp too slow”. I wonder if these have any impact.

| username: 脚本小王子 | Original post link

Thank you for the reply.
The method should be feasible, but if there are other solutions, I would still prefer not to solve it by upgrading the database, as upgrading may involve many unknown risks.

| username: 小龙虾爱大龙虾 | Original post link

Take a look at the slow queries right after startup.

| username: 脚本小王子 | Original post link

Different TiDB-Servers correspond to different services, so the configurations vary quite a bit.

| username: 像风一样的男子 | Original post link

You can view which SQL statements are using the most resources in TopSQL for a single TiKV CPU.

| username: 脚本小王子 | Original post link

Yes, when it just starts, almost no SQL can be executed, but switching to another TiDB allows normal execution.

| username: 脚本小王子 | Original post link

A single TiKV CPU being high is likely due to a table experiencing a large number of reads, but this doesn’t explain why SQL execution fails in a newly started TiDB server. The same SQL can be executed quickly on another TiDB server.

| username: 脚本小王子 | Original post link

Thank you for the reply.
What’s strange is that the SQL that cannot be executed shows a normal execution plan when using EXPLAIN on the same TiDB server.
OOM needs to be resolved, but it’s very difficult for TiDB to completely avoid OOM, so this issue also needs to be addressed.

| username: wluckdog | Original post link

On the TiDB instance that experienced OOM, you can find the SQL related to the OOM event. The keyword for resource-intensive SQL is “expensive_query”. Check the execution plan to see if there is a full table scan to avoid OOM. Your TiKV instance only has 32GB of memory, which is relatively small. A full table scan can significantly impact the cluster’s IO. The “show stats_healthy” command can only check whether the table’s statistics are healthy.

| username: Sunward | Original post link

Is there a write hotspot on that TiKV with such high read and write traffic?

| username: wangccsy | Original post link

It’s already OOM, so everything is definitely slow.