Out of Memory Quota Occurred in Stream Query

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 流式查询出现Out Of Memory Quota

| username: TiDBer_gU2A9xkv

[TiDB Usage Environment] Production Environment
[TiDB Version] 5.3.0
[Reproduction Path] Querying a large table (100 million+ rows) using the following streaming query


However, the database still encountered an Out Of Memory Quota![conn_id=94979] issue
mem-quota-query is set to 1G
I would like to ask the experts, is it because using streaming queries only reduces the number of records returned to the client at once by the database server, but on the database server side, all the data is still queried at once, and the queried data exceeds 1G, causing the issue?

| username: Raymond | Original post link

Post your query and execution plan. The server definitely won’t fetch all the data at once.

| username: TiDBer_gU2A9xkv | Original post link

Executed SQL and plan:

By the way, here is the stream writing method in the code:

| username: 张雨齐0720 | Original post link

Is there a way to enable cursor query in the JDBC URL?
Add useServerPrepStmts=true&useCursorFetch=true

| username: TiDBer_gU2A9xkv | Original post link

Adding it doesn’t work well either.

| username: TiDBer_gU2A9xkv | Original post link

Still Out Of Memory Quota

| username: Raymond | Original post link

  1. Add an index.
  2. SET tidb_distsql_scan_concurrency = 2; This can reduce the memory usage of the statement.

However, it is recommended to add an index.

| username: 张雨齐0720 | Original post link

For full table scans, just add an index, and use secondary index lookups.
It seems there isn’t a much better way.
Check the monitoring to see if this SQL is occupying memory.

| username: 我是咖啡哥 | Original post link

In the SQL, two ORDER BY clauses are used, and a full table scan is performed, pulling all the data from the table to the TiDB server for sorting. Naturally, this can easily cause an OOM (Out of Memory) issue. First, check if the project_id can filter a lot of data and create an index.

| username: TiDBer_gU2A9xkv | Original post link

In actual scenarios, the project_id basically does not filter out data. So I understand that because the data volume is too large, it is all sorted on the TiDB server, causing an OOM, right?

| username: TiDBer_gU2A9xkv | Original post link

I checked the monitoring and this SQL does indeed consume a lot of memory. It’s because it’s a big data query, so we used this cursor approach, but I didn’t expect it to result in an OOM (Out of Memory) error.

| username: TiDBer_gU2A9xkv | Original post link

Is the memory usage the same for streaming queries and regular queries in TiDB?

| username: TiDBer_gU2A9xkv | Original post link

I understand that streaming queries should be executed in parts, which should be different from regular queries that retrieve all results at once, right?

| username: 我是咖啡哥 | Original post link

Try removing the order by.
My understanding is that the order by causes streaming to be unavailable. Without sorting all the data, it is not known which row should be returned first.

| username: Raymond | Original post link

In theory, streaming queries will use less memory, but the performance will be slightly worse.

| username: Kongdom | Original post link

Another way is to increase the value of mem-quota-query. This configuration is mainly to ensure the performance of TP services. It is obvious that your query is not a TP service but an AP service, so you can appropriately increase this parameter.

| username: TiDBer_gU2A9xkv | Original post link

I think so too. It is because of this sorting that all the data was loaded into the TiDB server, resulting in a single query exceeding 1GB. I tried adding indexes for patientid and visitid, and it worked.

| username: tidb菜鸟一只 | Original post link

I think adding a combined index of project_id, patientid, and visitid is necessary. Is it enough to just add the latter two columns?

| username: TiDBer_gU2A9xkv | Original post link

Didn’t add projectId because, in the current situation, projectId won’t filter much data.

| username: 我是咖啡哥 | Original post link

I agree, although it doesn’t filter much data, it can use the orderliness of the index to reduce re-sorting.