Issues with Batch Insert into Select

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: batch insert into select 的问题

| username: zhanggame1

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] 7.1
[Reproduction Path] Operations performed that led to the issue
[Encountered Issue: Problem Phenomenon and Impact]
The ticket table has an auto-increment column ticket_id as the primary key, and there is also a unique constraint. Performing the following operation:
create table ticket1 like ticket;
batch on ticket_id limit 10000 insert into ticket1 select * from ticket;
During execution, an error similar to “1062 - Duplicate entry ‘358881-024919679577841035240089’ for key ‘ticket1.unique_ticket’” will definitely occur.
Although an error is reported, after inspection, all data has been successfully inserted. It feels like there is a bug here.

| username: redgame | Original post link

One possibility is that there are fewer duplicate data during insertion, so duplicates are rejected by the existing unique index during insertion. However, another possibility is that the duplicate threshold was exceeded at a certain point, leading to this error.

| username: zhanggame1 | Original post link

Theoretically, the original table and the new table are exactly the same, and there is no possibility of duplicate data. Batch insertion is done by grouping on ticket_id, and ticket_id itself is an auto-increment ID that will not be duplicated.

| username: 有猫万事足 | Original post link

I didn’t encounter a Duplicate entry with this statement. However, if the table structure of the source table is not designed for high concurrent writes, and the source table has around 300,000 to 400,000 rows, a hotspot issue occurs when importing into the cloned table. :joy:

| username: zhanggame1 | Original post link

The speed is okay, faster than directly using insert in select, and it uses less memory. Directly using insert in with millions of rows consumes a lot of memory. I use a limit of 10,000.

| username: jansu-dev | Original post link

  1. Ensure that the statement does not require atomicity, meaning that it allows some rows to be modified while others are not. → Non-Transactional DML Statements | PingCAP Documentation Center
  2. Ensure that the statement does not modify the content that the statement itself will read; otherwise, subsequent batches may read the content written by previous batches, leading to unexpected situations.
  3. I did a preliminary check, and there are still no known issues. Can you simulate a desensitized reproduction process in the test environment based on the existing data? It’s hard to investigate :thinking:
| username: zhanggame1 | Original post link

I encountered some issues even when using the database for single-machine testing without concurrent read and write operations.

| username: jansu-dev | Original post link

Could you please share the test steps so that everyone can reproduce it?

| username: zhanggame1 | Original post link

A table with an auto-increment primary key has approximately 1 million records. Then, create a new table like the original table, and use batch on the primary key limit 10000 to insert into the new table select * from the original table. The new table has no data, and the original table has not changed.

| username: Anna | Original post link

It could be a concurrency issue.

| username: zhanggame1 | Original post link

There is no concurrency; I am the only one using the test database.

| username: ljluestc | Original post link

In a TiDB cluster with TLS encryption enabled, there are three certificates used for secure communication:

CA Certificate:
The CA certificate (ca.crt) is the root certificate used to sign and verify other certificates within the cluster. It is used to establish trust between different components of the cluster. All components (TiDB, TiKV, PD) require the CA certificate to verify the authenticity of other certificates.

Client Certificate:
The client certificate (client.crt) is used by client applications (such as the TiDB client) to authenticate themselves to the TiDB cluster. When a client connects to the cluster, it provides its client certificate, which is verified against the CA certificate to ensure the client’s identity.

Client Private Key:
The client private key (client.pem) corresponds to the client certificate. It is used to sign messages and establish a secure connection with the TiDB cluster.

When configuring the cert-allowed-cn option for TiKV and PD, you need to specify the Common Name (CN) of the client certificates allowed to connect to TiKV and PD. The CN is a field in the client certificate that identifies the entity to which the certificate was issued. You can configure cert-allowed-cn with the CN values of the client certificates that should be allowed to access the respective components.

For example, if the CN of the TiKV client certificate is set to “tikv-client” and the CN of the PD client certificate is set to “pd-client,” you can configure cert-allowed-cn as follows:

For TiKV: cert-allowed-cn = [“tikv-client”]
For PD: cert-allowed-cn = [“pd-client”]

This ensures that only clients providing the specified CN in their certificates are allowed to connect to the TiKV and PD components, respectively.

Be sure to adjust the CN values and configuration according to your actual setup and certificate configuration.

| username: zhanggame1 | Original post link

You answered the question incorrectly.

| username: xfworld | Original post link

Does limit 10000 equate to data slicing, specifically primary key slicing, for batch insertion?
Is the primary key auto_random?

| username: zhanggame1 | Original post link

The primary key is a composite index, and the sharding field is an auto-increment field with no duplicates.