Questions about OOM issues during TiDB data migration

translator_bot · June 22, 2024, 5:16pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb 迁移数据OOM问题的疑惑

| username: Hacker_9LYnzJhP

【TiDB Usage Environment】Production Environment

【TiDB Version】5.4

【Encountered Problem: Problem Phenomenon and Impact】
Migrating full table data from TiDB to Hive, why does migrating 3.9 billion rows of full data take 5 hours without issues, but migrating 30 million rows of incremental data takes 40 minutes and results in OOM?
Full migration SQL: select * from table
Incremental migration SQL: select * from table where time > ‘2023-02-15 00:00:00’

Question:
Does TiDB have any optimization strategies for simple single-table queries like select * from table, such as fetching data from TiKV without occupying TiDB’s memory and directly sending it to the client?

TiDB error log is as follows:
[2023/02/16 02:18:12.207 +00:00] [INFO] [conn.go:1115] [“command dispatched failed”] [conn=2127961] [connInfo=“id:2127961, addr:11.77.124.97:57694 status:10, collation:utf8_general_ci, user:lbs_computer”] [command=Query] [status=“inTxn:0, autocommit:1”] [sql=“select * from table where time > ‘2023-02-15 00:00:00’”] [txn_mode=PESSIMISTIC] [err="Out Of Memory Quota![conn_id=2127961]

Program error log is as follows:
[2023-02-15T15:05:12.729+08:00] [INFO] plumber.reader.mysql-reader.invoke(logging.clj 272) [clojure-agent-send-off-pool-2] : MySQL database exception information is as follows:
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure

The last packet successfully received from the server was 1,799,921 milliseconds ago. The last packet sent successfully to the server was 2,482,983 milliseconds ago.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:409)
at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1127)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3715)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3604)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4155)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:926)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:2051)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:408)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:387)
at com.mysql.jdbc.RowDataDynamic.close(RowDataDynamic.java:165)
at com.mysql.jdbc.ResultSetImpl.realClose(ResultSetImpl.java:7471)
at com.mysql.jdbc.ResultSetImpl.close(ResultSetImpl.java:918)
at jdbc.jdbc$db_with_query_results_STAR_.invoke(jdbc.clj:670)
at plumber.reader.db_reader$fetch_from_db$fn__11826.invoke(db_reader.clj:12)
at jdbc.jdbc$with_connection_STAR_.invoke(jdbc.clj:847)
at plumber.reader.db_reader$fetch_from_db.invoke(db_reader.clj:8)
at plumber.reader.mysql_reader$fetch.invoke(mysql_reader.clj:14)
at plumber.reader.reader$eval13347$fn__13348.invoke(reader.clj:38)
at clojure.lang.MultiFn.invoke(MultiFn.java:231)
at plumber.reader.reader$reader$fn__13472.invoke(reader.clj:183)
at plumber.reader.reader$reader.invoke(reader.clj:182)
at plumber.job$sub_job$reify__15622$fn__15624.invoke(job.clj:69)
at clojure.core$binding_conveyor_fn$fn__369.invoke(core.clj:1910)
at clojure.lang.AFn.call(AFn.java:18)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost.
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3161)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3615)
… 25 more
[2023-02-15T15:05:12.730+08:00] [INFO] plumber.monitor.LogErrorMatch.matchLogError(LogErrorMatch.java 33) [clojure-agent-send-off-pool-2] : dbType:mysql
[2023-02-15T15:05:12.730+08:00] [INFO] plumber.monitor.LogErrorMatch.matchLogError(LogErrorMatch.java 34) [clojure-agent-send-off-pool-2] : errorMessage:Communications link failure

translator_bot · June 22, 2024, 5:16pm

| username: tidb菜鸟一只 | Original post link

Is there an index on the time field?

translator_bot · June 22, 2024, 5:16pm

| username: Raymond | Original post link

Does TiDB have any optimization strategies for simple single-table queries like select * from table? For example, when executing this SQL, can the data be fetched from TiKV and sent directly to the client without occupying TiDB’s memory?

If such queries cause TiDB to run out of memory (OOM), you can try the following adjustments:

Set tidb_distsql_scan_concurrency to 2 or 3.
Set tidb_enable_chunk_rpc to off.

Both of these adjustments will slow down the execution of the query, but they should help control memory usage. You can try both.

translator_bot · June 22, 2024, 5:16pm

| username: Hacker_9LYnzJhP | Original post link

My question is:
Migrating the full SQL: select * from table was successful.
Migrating the incremental SQL: select * from table where time > ‘2023-02-15 00:00:00’ resulted in OOM.
The full table query, 3.9 billion exported successfully, but querying 30 million based on the index resulted in OOM. Why is this happening?

translator_bot · June 22, 2024, 5:16pm

| username: tidb菜鸟一只 | Original post link

Can you check the execution plans of the two SQLs? Is there an index on the “time” column, causing the second SQL to use an index scan?

translator_bot · June 22, 2024, 5:16pm

| username: Hacker_9LYnzJhP | Original post link

The index was used, but there was an OOM?

translator_bot · June 22, 2024, 5:16pm

| username: tidb菜鸟一只 | Original post link

Using an index is not necessarily faster or more reasonable than a full table scan…

translator_bot · June 22, 2024, 5:16pm

| username: xingzhenxiang | Original post link

If it really doesn’t work, just add a limit to restrict the number of data rows.

translator_bot · June 22, 2024, 5:16pm

| username: TI表弟 | Original post link

To migrate, it is recommended to use the dump tool for migration, or add nodes and gradually remove the previous nodes. Otherwise, querying the full data all at once will definitely fail.

translator_bot · June 22, 2024, 5:16pm

| username: liuis | Original post link

Add a limit restriction?

translator_bot · June 22, 2024, 5:16pm

| username: ealam_小羽 | Original post link

Try DataX? We use DataX for heterogeneous incremental synchronization based on update_time. The update_time has an index, and DataX will batch insert the data.

translator_bot · June 22, 2024, 5:16pm

| username: Kongdom | Original post link

Is it possible that some memory was occupied during the full migration and not immediately released, causing insufficient memory during the incremental migration?

translator_bot · June 22, 2024, 5:16pm

| username: buddyyuan | Original post link

Please upload the directory containing the records generated by the OOM.

translator_bot · June 22, 2024, 5:16pm

| username: tracy0984 | Original post link

You can consider checking the memory usage of the TiDB nodes before executing the migration incremental SQL statements. For multiple large SQL statements, try not to query continuously on the same TiDB server node; distributing them across multiple TiDB servers would be better.