Read and Write Performance Degradation

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 读写时间变差

| username: 月明星稀

After running for a period of time, the read and write times of TiKV suddenly deteriorate, requiring about 150ms on the same machine. After executing tikv restart, the response time improves and returns to around 10ms. Could the experts please provide some troubleshooting ideas? Start by printing the logs at the point when the performance starts to degrade.

Logs:

[2023/11/29 03:48:07.487 +08:00] [INFO] [apply.rs:1612] ["execute admin command"] [command="cmd_type: ChangePeerV2 change_peer_v2 { changes { change_type: AddLearnerNode peer { id: 1110 store_id: 4 role: Learner } } }"] [index=505260] [term=7] [peer_id=1105] [region_id=1102]
[2023/11/29 03:48:07.487 +08:00] [INFO] [apply.rs:2204] ["exec ConfChangeV2"] [epoch="conf_ver: 71 version: 3"] [kind=Simple] [peer_id=1105] [region_id=1102]
[2023/11/29 03:48:07.487 +08:00] [INFO] [apply.rs:2385] ["conf change successfully"] ["current region"="id: 1102 start_key: 780000006D2F6563FF612D746573742F6DFF6F6E69746F725F67FF6C6F62616C2F3132FF302E3233322E3937FF2E3132315F737461FF7475730000000000FA end_key: 78000000732F6563FF612D746573742F6DFF6F6E69746F725F67FF6C6F62616C2F3132FF302E3233322E3937FF2E3132315F737461FF7475730000000000FA region_epoch { conf_ver: 72 version: 3 } peers { id: 1103 store_id: 5 } peers { id: 1104 store_id: 9 } peers { id: 1105 store_id: 6 } peers { id: 1110 store_id: 4 role: Learner }"] ["original region"="id: 1102 start_key: 780000006D2F6563FF612D746573742F6DFF6F6E69746F725F67FF6C6F62616C2F3132FF302E3233322E3937FF2E3132315F737461FF7475730000000000FA end_key: 78000000732F6563FF612D746573742F6DFF6F6E69746F725F67FF6C6F62616C2F3132FF302E3233322E3937FF2E3132315F737461FF7475730000000000FA region_epoch { conf_ver: 71 version: 3 } peers { id: 1103 store_id: 5 } peers { id: 1104 store_id: 9 } peers { id: 1105 store_id: 6 }"] [changes="[change_type: AddLearnerNode peer { id: 1110 store_id: 4 role: Learner }]"] [peer_id=1105] [region_id=1102]
[2023/11/29 03:48:07.488 +08:00] [INFO] [raft.rs:2646] ["switched to configuration"] [config="Configuration { voters: Configuration { incoming: Configuration { voters: {1104, 1105, 1103} }, outgoing: Configuration { voters: {} } }, learners: {1110}, learners_next: {}, auto_leave: false }"] [raft_id=1105] [region_id=1102]
[2023/11/29 03:48:09.299 +08:00] [INFO] [apply.rs:1612] ["execute admin command"] [command="cmd_type: ChangePeerV2 change_peer_v2 { changes { peer { id: 1110 store_id: 4 } } changes { change_type: AddLearnerNode peer { id: 1103 store_id: 5 role: Learner } } }"] [index=505267] [term=7] [peer_id=1105] [region_id=1102]
| username: 月明星稀 | Original post link

It suddenly becomes worse after running for about 2 days.

| username: Billmay表妹 | Original post link

When the read and write response time of TiKV suddenly deteriorates, you can follow the troubleshooting steps below:

  1. Check TiKV logs: Review the TiKV logs, especially around the time when the read and write response time worsened. The logs may contain some errors or abnormal information that can help identify the root cause of the problem. You can use the command tail -n 1000 <tikv_log_file> to view the latest logs.

  2. Check hardware resources: Examine the hardware resource usage of the server where TiKV is located, including CPU, memory, disk, and network. Ensure that resource usage has not reached a bottleneck, such as high CPU usage or insufficient memory. You can use system monitoring tools (like top, htop, etc.) to check resource usage.

  3. Check TiKV configuration: Review the TiKV configuration file to ensure that the configuration parameters are set reasonably. Pay special attention to performance-related parameters, such as raftstore.store-pool-size, raftstore.apply-pool-size, rocksdb.max-background-jobs, etc. You can refer to the official TiKV documentation [1] to understand the meaning and recommended values of these parameters.

  4. Check TiKV’s storage engine: If you are using TiKV’s default storage engine RocksDB, check whether the RocksDB configuration parameters are reasonable. Pay special attention to performance-related parameters, such as rocksdb.write-buffer-size, rocksdb.max-write-buffer-number, rocksdb.max-background-compactions, etc. You can refer to the official RocksDB documentation [2] to understand the meaning and recommended values of these parameters.

  5. Check the status of the TiKV cluster: Use TiUP or PD-CTL tools to check the status of the TiKV cluster, ensuring that all nodes in the cluster are running normally and that there are no abnormal Region distributions or Leader distributions. You can use the commands tiup ctl:v5.1.1 pd -u <pd_address> store and tiup ctl:v5.1.1 pd -u <pd_address> region to view the status information of the TiKV cluster.

  6. Check network connections: Verify whether the network connections between TiKV nodes are normal. You can use the ping command or other network diagnostic tools to test the connectivity and latency between nodes. Ensure that the network connection is stable and that there are no packet losses or excessively high latency.

| username: 芮芮是产品 | Original post link

Take a look at the slow SQL.

| username: 春风十里 | Original post link

First, give a thumbs up to the tech newbie who doesn’t understand technology.

| username: 随缘天空 | Original post link

A response time of around 150ms should be fine, not considered slow.

| username: 月明星稀 | Original post link

Local read and write calls with very little data shouldn’t be that slow, right?

| username: 月明星稀 | Original post link

It seems not, there are no slow SQL queries, everything is slow.

| username: 月明星稀 | Original post link

Could someone please explain what this log is doing? Thanks!!

| username: Jellybean | Original post link

The above logs are normal region scheduling logs at the INFO level. They are probably not the cause of the issue you mentioned.

You can refer to the following ideas for analysis and first locate the problem.

| username: 月明星稀 | Original post link

Just restart it, could it be memory fragmentation?

| username: 月明星稀 | Original post link

Will this log be an issue?

| username: Jellybean | Original post link

When selecting a problem period, it is more convenient to locate the issue through dashboard and Grafana monitoring charts. Follow the troubleshooting steps outlined in the analysis article above.

Using logs for troubleshooting is not as intuitive and convenient.

| username: 随缘天空 | Original post link

It is a normal response level, 10ms is considered high performance.

| username: dba远航 | Original post link

Try increasing the TiKV memory configuration a bit.

| username: andone | Original post link

Optimize the slow SQL and take a look.

| username: 春风十里 | Original post link

Is it possible that it’s an operating system resource issue? Did the memory usage increase after 2 days?

| username: Billmay表妹 | Original post link

[Resource Allocation] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page

Let’s take a look at your machine configuration. I guess the machine configuration is not high~

| username: come_true | Original post link

It feels like she is a system administrator, as she provides timely and comprehensive answers to every question on the forum.

| username: kkpeter | Original post link

Check the disk I/O.