Update Slowing Down, WHERE Condition Includes Primary Key and Others

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: update 更新变慢,where条件是主键+其他

| username: robert233

[TiDB Usage Environment]
Production environment, version v5.1.4

[Overview] Scenario + Problem Overview

  1. Dual write on the business side, the time point is July 11, and the average client RT is 15ms.
  2. On the evening of July 16 at 20:00, the average client RT started to soar, peaking at 340ms, consumption slowed down and backlog occurred.
  3. From the client’s perspective, all clients slowed down, but there was no change in TiDB QPS on the server side.

[Background] Operations performed

  1. None

[Phenomenon] Business and database phenomena

  1. The table is a range partition, id is auto_increment, it is a non-clustered index table, using SHARD_ROW_ID_BITS to scatter rowid, the table structure is as follows (other tables):

    CREATE TABLE `Music_GorillaGatewayRecord` (
    `id` bigint(20) NOT NULL AUTO_INCREMENT COMMENT 'Primary key id',
    `processType` varchar(32) NOT NULL COMMENT '',
    `appName` varchar(64) NOT NULL COMMENT '',
    `sdkVersion` varchar(32) NOT NULL COMMENT '',
    `businessType` varchar(32) NOT NULL COMMENT '',
    `ip` varchar(64) DEFAULT NULL COMMENT '',
    `status` int(4) NOT NULL COMMENT '',
    `result` text DEFAULT NULL COMMENT '',
    `callback` text DEFAULT NULL COMMENT '',
    `batch` tinyint(2) NOT NULL DEFAULT '0' COMMENT '',
    `count` int(4) NOT NULL DEFAULT '1' COMMENT '',
    `retry` int(4) NOT NULL DEFAULT '0' COMMENT '',
    `callback_Encrypt2013` longtext DEFAULT NULL COMMENT '',
    `createTime` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '',
    `updateTime` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '',
    PRIMARY KEY (`id`,`createTime`) /*T![clustered_index] NONCLUSTERED */,
    KEY `idx_appName_bType_status` (`appName`,`businessType`,`status`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin/*!90000 
    SHARD_ROW_ID_BITS=5 */ COMMENT=''
    PARTITION BY RANGE ( TO_DAYS(`createTime`) ) (
    PARTITION `p20220501` VALUES LESS THAN (738641),
    PARTITION `p20220601` VALUES LESS THAN (738672),
    PARTITION `p20220701` VALUES LESS THAN (738702),
    PARTITION `p20220801` VALUES LESS THAN (738733),
    PARTITION `p20220901` VALUES LESS THAN (738764),
    PARTITION `p20221001` VALUES LESS THAN (738794),
    PARTITION `p20221101` VALUES LESS THAN (738825),
    PARTITION `p20221201` VALUES LESS THAN (738855),
    PARTITION `p20230101` VALUES LESS THAN (738886));
    
  2. Dashboard and slowlog (100ms)



  3. Some monitoring charts of the server and client:
    TiDB Server monitoring [2022-07-16 12:00:00-2022-07-17 20:00:00]:

    Client monitoring [2022-07-16 12:00:00-2022-07-17 21:00:00]:

In conclusion, I would like to ask the community experts for troubleshooting ideas.

| username: ablewang_xiaobo | Original post link

I previously tested an update in TiDB, and it took around 100ms. Your update by primary key should return results in seconds.

| username: ablewang_xiaobo | Original post link

Is it possible that the analyze during this time period affected this update? Or was the overall system load relatively high at that time, causing the response time of all queries to be relatively long?

| username: robert233 | Original post link

The cluster’s load issue is not significant, and the time period for analyze does not match the time period when business access is slow.
7 TiDB + 12 TiKV, 48-core 256GB physical machines, TiKV is multi-instance with disk isolation.

| username: ddhe9527 | Original post link

For the update that takes about 6 seconds, can you take a look and see which part is taking the time?

| username: ablewang_xiaobo | Original post link

There is no need for so many TiDB instances, 2-3 are enough.

| username: songxuecheng | Original post link

It looks like there is a backoff. You can check the monitoring and logs at that time to see if there is a lock situation.

| username: ablewang_xiaobo | Original post link

Alternatively, you can use the trace command to check. The usage is as follows:
trace format=‘row’ select * from mysql.user;

| username: robert233 | Original post link

There is slowness in Coprocessor, slowness in prewrite, and most of it is in Coprocessor.

| username: robert233 | Original post link

From the monitoring, the pressure on TiDB was quite high, but it improved significantly after scaling out and optimizing the jdbc parameters.

| username: ddhe9527 | Original post link

Check the heatmap on the Dashboard to see if there are any hotspot regions.

| username: robert233 | Original post link

| username: robert233 | Original post link

In the partition table design, the composite primary key consists of id and createTime, and SHARD_ROW_ID_BITS is used for scattering. Does the heatmap show that primary is a continuous write hotspot?

| username: ddhe9527 | Original post link

Looking at that slash, there seems to be a PK hotspot. From the graph, the traffic isn’t high. You can check a few more points to confirm. If the traffic isn’t high, then it’s possible that the QPS is high.

| username: robert233 | Original post link

I checked a few buckets, and the peak traffic is at 150M.

| username: robert233 | Original post link

The diagonal lines on the heatmap are basically all index hotspots.

| username: robert233 | Original post link

KV errors

| username: ddhe9527 | Original post link

You can check if there is a correlation between the client’s RT fluctuations and the heatmap traffic values on the timeline.

| username: songxuecheng | Original post link

Refer to this, I feel like your issue is caused by lock conflicts.

| username: tidb狂热爱好者 | Original post link

It should be an error in the statistics.