Update Slowing Down, WHERE Condition Includes Primary Key and Others

translator_bot · June 23, 2024, 9:37am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: update 更新变慢，where条件是主键+其他

| username: robert233

[TiDB Usage Environment]
Production environment, version v5.1.4

[Overview] Scenario + Problem Overview

Dual write on the business side, the time point is July 11, and the average client RT is 15ms.
On the evening of July 16 at 20:00, the average client RT started to soar, peaking at 340ms, consumption slowed down and backlog occurred.
From the client’s perspective, all clients slowed down, but there was no change in TiDB QPS on the server side.

[Background] Operations performed

None

[Phenomenon] Business and database phenomena

The table is a range partition, id is auto_increment, it is a non-clustered index table, using SHARD_ROW_ID_BITS to scatter rowid, the table structure is as follows (other tables):

CREATE TABLE `Music_GorillaGatewayRecord` (
`id` bigint(20) NOT NULL AUTO_INCREMENT COMMENT 'Primary key id',
`processType` varchar(32) NOT NULL COMMENT '',
`appName` varchar(64) NOT NULL COMMENT '',
`sdkVersion` varchar(32) NOT NULL COMMENT '',
`businessType` varchar(32) NOT NULL COMMENT '',
`ip` varchar(64) DEFAULT NULL COMMENT '',
`status` int(4) NOT NULL COMMENT '',
`result` text DEFAULT NULL COMMENT '',
`callback` text DEFAULT NULL COMMENT '',
`batch` tinyint(2) NOT NULL DEFAULT '0' COMMENT '',
`count` int(4) NOT NULL DEFAULT '1' COMMENT '',
`retry` int(4) NOT NULL DEFAULT '0' COMMENT '',
`callback_Encrypt2013` longtext DEFAULT NULL COMMENT '',
`createTime` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '',
`updateTime` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '',
PRIMARY KEY (`id`,`createTime`) /*T![clustered_index] NONCLUSTERED */,
KEY `idx_appName_bType_status` (`appName`,`businessType`,`status`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin/*!90000 
SHARD_ROW_ID_BITS=5 */ COMMENT=''
PARTITION BY RANGE ( TO_DAYS(`createTime`) ) (
PARTITION `p20220501` VALUES LESS THAN (738641),
PARTITION `p20220601` VALUES LESS THAN (738672),
PARTITION `p20220701` VALUES LESS THAN (738702),
PARTITION `p20220801` VALUES LESS THAN (738733),
PARTITION `p20220901` VALUES LESS THAN (738764),
PARTITION `p20221001` VALUES LESS THAN (738794),
PARTITION `p20221101` VALUES LESS THAN (738825),
PARTITION `p20221201` VALUES LESS THAN (738855),
PARTITION `p20230101` VALUES LESS THAN (738886));

Dashboard and slowlog (100ms)

POPO20220718-1512241034×578 125 KB

POPO20220718-1512451034×466 78.8 KB

POPO20220718-1513021034×598 135 KB
Some monitoring charts of the server and client:
TiDB Server monitoring [2022-07-16 12:00:00-2022-07-17 20:00:00]:

WX20220718-151933%402x1034×414 60.9 KB

Client monitoring [2022-07-16 12:00:00-2022-07-17 21:00:00]:

WX20220718-152339%402x1034×498 31.9 KB

In conclusion, I would like to ask the community experts for troubleshooting ideas.

translator_bot · June 23, 2024, 9:37am

| username: ablewang_xiaobo | Original post link

I previously tested an update in TiDB, and it took around 100ms. Your update by primary key should return results in seconds.

translator_bot · June 23, 2024, 9:37am

| username: ablewang_xiaobo | Original post link

Is it possible that the analyze during this time period affected this update? Or was the overall system load relatively high at that time, causing the response time of all queries to be relatively long?

translator_bot · June 23, 2024, 9:37am

| username: robert233 | Original post link

The cluster’s load issue is not significant, and the time period for analyze does not match the time period when business access is slow.
7 TiDB + 12 TiKV, 48-core 256GB physical machines, TiKV is multi-instance with disk isolation.

translator_bot · June 23, 2024, 9:37am

| username: ddhe9527 | Original post link

For the update that takes about 6 seconds, can you take a look and see which part is taking the time?

translator_bot · June 23, 2024, 9:37am

| username: ablewang_xiaobo | Original post link

There is no need for so many TiDB instances, 2-3 are enough.

translator_bot · June 23, 2024, 9:37am

| username: songxuecheng | Original post link

It looks like there is a backoff. You can check the monitoring and logs at that time to see if there is a lock situation.

translator_bot · June 23, 2024, 9:37am

| username: ablewang_xiaobo | Original post link

Alternatively, you can use the trace command to check. The usage is as follows:
trace format=‘row’ select * from mysql.user;

translator_bot · June 23, 2024, 9:37am

| username: robert233 | Original post link

There is slowness in Coprocessor, slowness in prewrite, and most of it is in Coprocessor.

translator_bot · June 23, 2024, 9:37am

| username: robert233 | Original post link

From the monitoring, the pressure on TiDB was quite high, but it improved significantly after scaling out and optimizing the jdbc parameters.

translator_bot · June 23, 2024, 9:37am

| username: ddhe9527 | Original post link

Check the heatmap on the Dashboard to see if there are any hotspot regions.

translator_bot · June 23, 2024, 9:37am

| username: robert233 | Original post link

translator_bot · June 23, 2024, 9:37am

| username: robert233 | Original post link

In the partition table design, the composite primary key consists of id and createTime, and SHARD_ROW_ID_BITS is used for scattering. Does the heatmap show that primary is a continuous write hotspot?

translator_bot · June 23, 2024, 9:37am

| username: ddhe9527 | Original post link

Looking at that slash, there seems to be a PK hotspot. From the graph, the traffic isn’t high. You can check a few more points to confirm. If the traffic isn’t high, then it’s possible that the QPS is high.