Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: 数据插入tidb集群耗时越来越长,最终导致数据库连接中断,程序崩溃
The data insertion into the TiDB cluster is taking longer and longer, eventually leading to database connection interruptions and program crashes.
Currently, the cluster configuration uses 3 virtual machines, each with 8-core CPU, 16GB memory, and 100GB disk. The data being written is 5000 rows per write, 10 times per second.
Monitoring shows that the longest insertion time has reached 10 minutes:
Some error logs were queried:
Analyze the execution process of slow queries on the TiDB Dashboard, first identify which step is the slowest, and then proceed with the corresponding analysis and handling.
On the CPU and storage monitoring graph, does each insert follow a single value or 5000 values?
Please also share the table structure, it might be a hotspot.
Each insert is followed by only 5000 values.
Does the table have a primary key, and are the values continuous? For example, (1, xx), (2, xx) like this.
CREATE TABLE GponPmOnuLocalInfos
(
node_id_str
varchar(100) NOT NULL COMMENT ‘Device hostname, unique identifier in the network’,
subscription_id_str
varchar(100) NOT NULL COMMENT ‘Subscription name’,
collection_id
bigint(20) unsigned NOT NULL COMMENT ‘Identifier for sampling round’,
collection_start_time
datetime DEFAULT NULL COMMENT ‘Start time of the sampling round’,
name
varchar(100) DEFAULT NULL COMMENT ‘//ONU name, format: v_ani.f.s.p.onuid’,
channel
varchar(100) DEFAULT NULL,
index
int(10) unsigned DEFAULT NULL,
olt_rx_power
int(10) DEFAULT NULL COMMENT ‘//Optical power received by OLT from ONU, unit: 0.01dBm’,
online_duration
int(10) unsigned DEFAULT NULL COMMENT ‘//Online duration, unit: seconds’,
last_down_time
int(10) unsigned DEFAULT NULL COMMENT ‘//Last offline time’,
last_down_cause
int(10) unsigned DEFAULT NULL COMMENT ‘//Last offline cause’,
onu_status
int(10) unsigned DEFAULT NULL COMMENT ‘//ONU status, 1-online, 2-offline, 255-invalid’,
PRIMARY KEY (node_id_str
,subscription_id_str
,collection_id
)
) SHARD_ROW_ID_BITS = 4 PRE_SPLIT_REGIONS=3;
Do you have TiDB, PD, and TiKV on all three nodes, and the TiDB server for insert operations is only connected to the blue one?
Here is the table structure with the primary key set:
CREATE TABLE `GponPmOnuLocalInfos` (
`node_id_str` varchar(100) NOT NULL COMMENT 'Device hostname, unique identifier for the device in the network',
`subscription_id_str` varchar(100) NOT NULL COMMENT 'Subscription name',
`collection_id` bigint(20) unsigned NOT NULL COMMENT 'Identifies the sampling round',
`collection_start_time` datetime DEFAULT NULL COMMENT 'Identifies the start time of the sampling round',
`name` varchar(100) DEFAULT NULL COMMENT 'ONU name, format: v_ani.f.s.p.onuid',
`channel` varchar(100) DEFAULT NULL,
`index` int(10) unsigned DEFAULT NULL,
`olt_rx_power` int(10) DEFAULT NULL COMMENT 'Optical power received by the OLT optical module from the ONU, unit: 0.01dBm',
`online_duration` int(10) unsigned DEFAULT NULL COMMENT 'Online duration, unit: seconds',
`last_down_time` int(10) unsigned DEFAULT NULL COMMENT 'Last offline time',
`last_down_cause` int(10) unsigned DEFAULT NULL COMMENT 'Last offline cause',
`onu_status` int(10) unsigned DEFAULT NULL COMMENT 'ONU status, 1-online, 2-offline, 255-invalid',
PRIMARY KEY (`node_id_str`,`subscription_id_str`,`collection_id`)
) SHARD_ROW_ID_BITS = 4 PRE_SPLIT_REGIONS=3;
Monitor the upper and lower disks, or restart the insert program, then use dstat 1 or vmstat 1 to output the host status, and take a screenshot.
All three TiDBs have connections, and each has exceeded 10 minutes.
It’s not a clustered table, right? Try changing it to a clustered table and see if it works.
Take a look at the disk monitoring:
Why are there still intermittent lines? Did TiKV crash?
There have been crashes after around 9 PM.
What is the storage? SSD? Mechanical? I still suggest running the insert again and taking a screenshot of the system status (dstat 1 or vmstat 1), or checking the disk monitoring in Prometheus.
Take a look at the table structure, how should it be modified:
CREATE TABLE `GponPmOnuLocalInfos` (
`node_id_str` varchar(100) NOT NULL COMMENT 'Device hostname, unique identifier for the device in the network',
`subscription_id_str` varchar(100) NOT NULL COMMENT 'Subscription name',
`collection_id` bigint(20) unsigned NOT NULL COMMENT 'Identifies the sampling round',
`collection_start_time` datetime DEFAULT NULL COMMENT 'Identifies the start time of the sampling round',
`name` varchar(100) DEFAULT NULL COMMENT 'ONU name, format: v_ani.f.s.p.onuid',
`channel` varchar(100) DEFAULT NULL,
`index` int(10) unsigned DEFAULT NULL,
`olt_rx_power` int(10) DEFAULT NULL COMMENT 'Optical power received by the OLT optical module from the ONU, unit: 0.01dBm',
`online_duration` int(10) unsigned DEFAULT NULL COMMENT 'Online duration, unit: seconds',
`last_down_time` int(10) unsigned DEFAULT NULL COMMENT 'Last offline time',
`last_down_cause` int(10) unsigned DEFAULT NULL COMMENT 'Last offline reason',
`onu_status` int(10) unsigned DEFAULT NULL COMMENT 'ONU status, 1-online, 2-offline, 255-invalid',
PRIMARY KEY (`node_id_str`,`subscription_id_str`,`collection_id`)
) SHARD_ROW_ID_BITS = 4 PRE_SPLIT_REGIONS=3;
Add the keyword after the primary key: CLUSTERED
You didn’t set AUTO_RANDOM when creating the table, so there is a hotspot issue with the writes.