Issues Regarding Single Table Insert Performance

translator_bot · June 22, 2024, 6:34pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 关于单表插入性能的问题

| username: Qiuchi

【TiDB Usage Environment】Testing
【TiDB Version】6.1.0
【Reproduction Path】Insert data into a single table using JDBC, comparing the performance of single connection and multiple connections
【Encountered Problem】
When inserting with a single connection, the speed is about 5k/s; when inserting with 3 connections simultaneously, the speed is about 15k/s; after adding more connections, the overall speed does not increase, and the speed of a single connection slows down. Currently, innodb_thread_concurrency=0 is set. Are there any parameters in TiDB or KV that might affect the insertion performance in this scenario?

translator_bot · June 22, 2024, 6:34pm

| username: WalterWj | Original post link

translator_bot · June 22, 2024, 6:34pm

| username: Qiuchi | Original post link

Currently, batch update is already being used during testing.

translator_bot · June 22, 2024, 6:34pm

| username: WalterWj | Original post link

Let’s check the cluster pressure:

Are there any hotspots?
CPU and memory usage of each component
Is TiKV’s I/O already a bottleneck?

translator_bot · June 22, 2024, 6:34pm

| username: Qiuchi | Original post link

The same test data is used in both cases, there are hotspot issues, but it should not be the variables.
When multiple connections are inserting, the CPU usage of both TiDB and TiKV components is higher than when a single connection is used.
Given that the overall speed is faster with multiple connections, can we disregard the TiKV IO bottleneck?

Can I assume that, in terms of mechanism, there should be no significant difference in single-table insertion performance between single connection and multiple connections?

translator_bot · June 22, 2024, 6:34pm

| username: TiDBer_jYQINSnf | Original post link

Only three links are maxed out? If adding more links doesn’t increase the overall throughput, then you need to check whether the CPU of TiDB or TiKV is maxed out, and then either scale up accordingly or balance the hotspots.

translator_bot · June 22, 2024, 6:34pm

| username: Qiuchi | Original post link

Sorry, I may not have described it clearly. This is a scenario of inserting a large amount of data at once. During testing, I found that the performance of single-connection insertion was not good, so I tried multi-connection simultaneous insertion and found that the overall speed (combined) improved a lot. Therefore, I wanted to know if there is some kind of limitation on the efficiency of single-connection insertion. When we previously used Oracle, even when using multiple connections to insert, the combined speed was basically the same as that of a single connection, and this speed was much faster than the single-connection speed we tested with TiDB.

translator_bot · June 22, 2024, 6:34pm

| username: TiDBer_jYQINSnf | Original post link

That is for sure. For example, if the latency of a single link and a single query is 10 milliseconds, then the QPS (queries per second) would be at most 100 because it is serial. However, this does not mean that TiDB cannot handle more. TiDB has more CPUs to handle more requests because it is a distributed database. Each query requires many network interactions, so the latency is relatively high. But while waiting for network I/O, the CPUs of TiDB and TiKV nodes are still idle, and they can handle other requests during this time. Therefore, the more connections there are, the better the performance, until the CPU is fully utilized.

translator_bot · June 22, 2024, 6:34pm

| username: Qiuchi | Original post link

I just found that even with preparedStmt, batchUpdate, and rewriteBatchedStatements (size around 2000-10000), this insert SQL still takes a lot of time to parse and compile during execution, even taking up more than half of the total time (first row in groups of 2000, second row in groups of 10000).

translator_bot · June 22, 2024, 6:34pm

| username: TiDBer_jYQINSnf | Original post link

parse refers to parsing SQL; if a large number of values are sent at once, the parsing time will be longer.
compile refers to the query optimization time.

translator_bot · June 22, 2024, 6:34pm

| username: gcworkerishungry | Original post link

I have encountered the issue where the batch size is too large, and reducing it allows the use of the plan cache.

translator_bot · June 22, 2024, 6:34pm

| username: dba-kit | Original post link

You can refer to an official optimization practice, where the parameter settings are general and can significantly reduce the number of useless SQLs.

translator_bot · June 22, 2024, 6:34pm

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.