Data insertion into TiDB cluster takes longer and longer, eventually causing database connection interruption and program crash

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 数据插入tidb集群耗时越来越长,最终导致数据库连接中断,程序崩溃

| username: TiDBer_Xy7fsN7j

The data insertion into the TiDB cluster is taking longer and longer, eventually leading to database connection interruptions and program crashes.
Currently, the cluster configuration uses 3 virtual machines, each with 8-core CPU, 16GB memory, and 100GB disk. The data being written is 5000 rows per write, 10 times per second.
Monitoring shows that the longest insertion time has reached 10 minutes:


Some error logs were queried:

| username: Jellybean | Original post link

Analyze the execution process of slow queries on the TiDB Dashboard, first identify which step is the slowest, and then proceed with the corresponding analysis and handling.

| username: 托马斯滑板鞋 | Original post link

:joy: On the CPU and storage monitoring graph, does each insert follow a single value or 5000 values?

| username: forever | Original post link

Please also share the table structure, it might be a hotspot.

| username: TiDBer_Xy7fsN7j | Original post link

Each insert is followed by only 5000 values.

| username: 托马斯滑板鞋 | Original post link

Does the table have a primary key, and are the values continuous? For example, (1, xx), (2, xx) like this.

| username: TiDBer_Xy7fsN7j | Original post link

CREATE TABLE GponPmOnuLocalInfos (
node_id_str varchar(100) NOT NULL COMMENT ‘Device hostname, unique identifier in the network’,
subscription_id_str varchar(100) NOT NULL COMMENT ‘Subscription name’,
collection_id bigint(20) unsigned NOT NULL COMMENT ‘Identifier for sampling round’,
collection_start_time datetime DEFAULT NULL COMMENT ‘Start time of the sampling round’,
name varchar(100) DEFAULT NULL COMMENT ‘//ONU name, format: v_ani.f.s.p.onuid’,
channel varchar(100) DEFAULT NULL,
index int(10) unsigned DEFAULT NULL,
olt_rx_power int(10) DEFAULT NULL COMMENT ‘//Optical power received by OLT from ONU, unit: 0.01dBm’,
online_duration int(10) unsigned DEFAULT NULL COMMENT ‘//Online duration, unit: seconds’,
last_down_time int(10) unsigned DEFAULT NULL COMMENT ‘//Last offline time’,
last_down_cause int(10) unsigned DEFAULT NULL COMMENT ‘//Last offline cause’,
onu_status int(10) unsigned DEFAULT NULL COMMENT ‘//ONU status, 1-online, 2-offline, 255-invalid’,
PRIMARY KEY (node_id_str,subscription_id_str,collection_id)
) SHARD_ROW_ID_BITS = 4 PRE_SPLIT_REGIONS=3;

| username: TiDBer_Xy7fsN7j | Original post link

CPU/Memory Monitoring:

| username: 托马斯滑板鞋 | Original post link

:joy: Do you have TiDB, PD, and TiKV on all three nodes, and the TiDB server for insert operations is only connected to the blue one?

| username: TiDBer_Xy7fsN7j | Original post link

Here is the table structure with the primary key set:

CREATE TABLE `GponPmOnuLocalInfos` (
  `node_id_str` varchar(100) NOT NULL COMMENT 'Device hostname, unique identifier for the device in the network',
  `subscription_id_str` varchar(100) NOT NULL COMMENT 'Subscription name',
  `collection_id` bigint(20) unsigned NOT NULL COMMENT 'Identifies the sampling round',
  `collection_start_time` datetime DEFAULT NULL COMMENT 'Identifies the start time of the sampling round',
  `name` varchar(100) DEFAULT NULL COMMENT 'ONU name, format: v_ani.f.s.p.onuid',
  `channel` varchar(100) DEFAULT NULL,
  `index` int(10) unsigned DEFAULT NULL,
  `olt_rx_power` int(10) DEFAULT NULL COMMENT 'Optical power received by the OLT optical module from the ONU, unit: 0.01dBm',
  `online_duration` int(10) unsigned DEFAULT NULL COMMENT 'Online duration, unit: seconds',
  `last_down_time` int(10) unsigned DEFAULT NULL COMMENT 'Last offline time',
  `last_down_cause` int(10) unsigned DEFAULT NULL COMMENT 'Last offline cause',
  `onu_status` int(10) unsigned DEFAULT NULL COMMENT 'ONU status, 1-online, 2-offline, 255-invalid',
  PRIMARY KEY (`node_id_str`,`subscription_id_str`,`collection_id`)
) SHARD_ROW_ID_BITS = 4 PRE_SPLIT_REGIONS=3;
| username: 托马斯滑板鞋 | Original post link

Monitor the upper and lower disks, or restart the insert program, then use dstat 1 or vmstat 1 to output the host status, and take a screenshot.

| username: TiDBer_Xy7fsN7j | Original post link

All three TiDBs have connections, and each has exceeded 10 minutes.

| username: zhanggame1 | Original post link

It’s not a clustered table, right? Try changing it to a clustered table and see if it works.

| username: TiDBer_Xy7fsN7j | Original post link

Take a look at the disk monitoring:

| username: 托马斯滑板鞋 | Original post link

:joy: Why are there still intermittent lines? Did TiKV crash?

| username: TiDBer_Xy7fsN7j | Original post link

There have been crashes after around 9 PM.

| username: 托马斯滑板鞋 | Original post link

What is the storage? SSD? Mechanical? I still suggest running the insert again and taking a screenshot of the system status (dstat 1 or vmstat 1), or checking the disk monitoring in Prometheus.

| username: TiDBer_Xy7fsN7j | Original post link

Take a look at the table structure, how should it be modified:

CREATE TABLE `GponPmOnuLocalInfos` (
  `node_id_str` varchar(100) NOT NULL COMMENT 'Device hostname, unique identifier for the device in the network',
  `subscription_id_str` varchar(100) NOT NULL COMMENT 'Subscription name',
  `collection_id` bigint(20) unsigned NOT NULL COMMENT 'Identifies the sampling round',
  `collection_start_time` datetime DEFAULT NULL COMMENT 'Identifies the start time of the sampling round',
  `name` varchar(100) DEFAULT NULL COMMENT 'ONU name, format: v_ani.f.s.p.onuid',
  `channel` varchar(100) DEFAULT NULL,
  `index` int(10) unsigned DEFAULT NULL,
  `olt_rx_power` int(10) DEFAULT NULL COMMENT 'Optical power received by the OLT optical module from the ONU, unit: 0.01dBm',
  `online_duration` int(10) unsigned DEFAULT NULL COMMENT 'Online duration, unit: seconds',
  `last_down_time` int(10) unsigned DEFAULT NULL COMMENT 'Last offline time',
  `last_down_cause` int(10) unsigned DEFAULT NULL COMMENT 'Last offline reason',
  `onu_status` int(10) unsigned DEFAULT NULL COMMENT 'ONU status, 1-online, 2-offline, 255-invalid',
  PRIMARY KEY (`node_id_str`,`subscription_id_str`,`collection_id`)
) SHARD_ROW_ID_BITS = 4 PRE_SPLIT_REGIONS=3;
| username: 托马斯滑板鞋 | Original post link

Add the keyword after the primary key: CLUSTERED

| username: 像风一样的男子 | Original post link

You didn’t set AUTO_RANDOM when creating the table, so there is a hotspot issue with the writes.