Reasons for a large number of ["kv rpc failed"] [err=RemoteStopped] logs in TiKV?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv的日志大量[“kv rpc failed”] [err=RemoteStopped]的原因?

| username: TiDBer_fancy

[TiDB Usage Environment] Production Environment
[TiDB Version] v5.3
大量tikv的日志显示[“kv rpc failed”] [err=RemoteStopped]

[2023/12/01 15:59:48.895 +08:00] [INFO] [kv.rs:1023] [“kv rpc failed”] [err=RemoteStopped] [request=batch_commands]
[2023/12/01 15:59:48.895 +08:00] [INFO] [kv.rs:1023] [“kv rpc failed”] [err=RemoteStopped] [request=batch_commands]
[2023/12/01 15:59:48.895 +08:00] [INFO] [kv.rs:1023] [“kv rpc failed”] [err=RemoteStopped] [request=batch_commands]
[2023/12/01 15:59:48.954 +08:00] [INFO] [kv.rs:1023] [“kv rpc failed”] [err=RemoteStopped] [request=batch_commands]
[2023/12/01 15:59:48.954 +08:00] [INFO] [kv.rs:1023] [“kv rpc failed”] [err=RemoteStopped] [request=batch_commands]
[2023/12/01 15:59:48.954 +08:00] [INFO] [kv.rs:1023] [“kv rpc failed”] [err=RemoteStopped] [request=batch_commands]
[2023/12/01 15:59:48.954 +08:00] [INFO] [kv.rs:1023] [“kv rpc failed”] [err=RemoteStopped] [request=batch_commands]
[2023/12/01 15:59:49.023 +08:00] [INFO] [kv.rs:1023] [“kv rpc failed”] [err=RemoteStopped] [request=batch_commands]
[2023/12/01 15:59:49.023 +08:00] [INFO] [kv.rs:1023] [“kv rpc failed”] [err=RemoteStopped] [request=batch_commands]

What is the reason for this?

| username: zhanggame1 | Original post link

[INFO] Level can be ignored.

| username: TiDBer_fancy | Original post link

However, it keeps reporting this, many times per second. Why does it keep reporting this? Can it be stopped from reporting?

| username: xfworld | Original post link

  1. Can the version be clearer?
  2. Are all TiKV nodes like this?
  3. What is the current cluster configuration? What is its status?
  4. What issues are affecting current usage?
| username: TiDBer_fancy | Original post link

  1. Version: v5.3.0
  2. All TiKV nodes are experiencing this issue, with up to 10 files per day, and each file containing 2,660,418 entries of this log.
  3. Cluster configuration as shown in the image:
    • Memory: 256GB
    • 2 NVMe drives
  4. Not sure if it has any impact, just want to know why the log volume is so large? Are there any hidden risks?
  5. Currently mainly used for JuiceFS to store metadata, the business is relatively simple, number of regions: 213,955, total QPS of TiKV is around 1 million.
  6. PD secondary nodes and TiKV are mixed, PD primary node is deployed separately, 128-core CPU, current CPU usage of the PD primary node is around 50%, is having 7 PD nodes a bit too much?
| username: xfworld | Original post link

  1. If PD and TiKV nodes can be deployed separately, do not mix them.

  2. The number of PD replicas and high availability can refer to ETCD’s fault tolerance capabilities (handle according to your desired scenario).

  3. Check whether the number of region replicas is evenly distributed across all KV nodes.

  4. Check whether the number of region replicas has reached the set amount (usually three replicas).

  5. There is a risk; this error describes a failure in the KV RPC request… (Please ensure that network communication between all nodes is normal).

  6. Check whether the cluster’s network bandwidth has reached W Mbps…

  7. Did this situation occur suddenly? Were there any operations performed before this happened?

| username: TiDBer_fancy | Original post link

  1. Observing the network traffic from the primary node of pd-master, the network traffic between it and other pd nodes is relatively high. I understand that the more pd nodes there are, the higher the resource consumption. Does this also significantly consume the CPU resources of the primary node? Currently, the CPU usage is at 50%. If we reduce two nodes, can we lower it a bit? How many pd nodes do most companies in the industry typically configure? Another issue is that we currently have 213,955 regions, and the pd-master with 128 cores is using 50% of the CPU. As the number of regions increases, will the risk of this host configuration also increase? How should we optimize and handle this?

  2. This “kv rpc failed” issue seems to have persisted for a long time. The network card speed is 25000Mb/s, and I see that the network between nodes is connected. During this period, we only performed a scale-in and scale-out operation.

  3. The number of regions is relatively uneven. To reduce back off, our pd scheduling is only enabled when the QPS is not very high.

  4. How can we check if the number of region replicas has reached the set amount?

| username: dba远航 | Original post link

Is there a connection limit for remote machines?

| username: xfworld | Original post link

  1. The more regions there are, the more resources will be consumed:
    • CPU resources
    • Memory resources
    • Network resources
      It is necessary to maintain the heartbeat processing of regions.
  2. There is a reference document for handling a massive number of regions (海量 Region 集群调优最佳实践 | PingCAP 文档中心).
  3. First, consider checking the current status of regions to determine the current replica information. This can be done through Grafana by checking the following parameters:
    • miss-peer: Regions with missing replicas
    • extra-peer: Regions with extra replicas
    • down-peer: Regions with replicas in Down status
    • pending-peer: Regions with replicas in Pending status
      You can also use the PD CLI to perform a scan (PD Control 使用说明 | PingCAP 文档中心).
  4. Uneven regions can lead to read/write skew, where some KV nodes are particularly busy while others are very idle.
  5. It can even lead to a large number of hotspot issues, which can be checked through the PD dashboard interface.

Prioritize checking regions and try to perform some merges to reduce the number of empty regions and regions. This will increase the available resources and improve efficiency.

Please refer!

| username: TiDBer_fancy | Original post link

Where is this set?

| username: 随缘天空 | Original post link

Check the cluster status to see if there are any node failures.

| username: TiDBer_fancy | Original post link

Have you checked it?

| username: TiDBer_fancy | Original post link

  1. Is 213,955 regions considered a massive amount? The total data volume is less than 14TB, but the PD master is using up the entire 64-core CPU. Is this reasonable?
  2. All region statuses are OK, and there are no obvious hot read/write spots.
  3. The reason for the large number of “kv rpc failed” logs in all TiKV instances has not been found yet.
| username: xfworld | Original post link

  1. The number of regions is not directly related to PD, but the heartbeat information of the regions will be transmitted to PD, which requires PD to perform global control and maintenance.
  2. When PD 64C is fully utilized, is it continuously fully utilized or just peak utilization? Please judge according to the situation of Client access usage (it would be best to provide the monitoring graph as well).
  3. For kv rpc failed, please check the PD logs to see if there are any other anomalies.

Based on the information provided, I can only try to help you make some judgments and handle it, please understand… :rofl:

| username: h5n1 | Original post link

  1. A region count of 213,955 is not considered high. Are you sure the PD process is using 64 CPUs? This doesn’t seem normal. You can check the PD monitoring for CPU usage.
  2. If the PD leader’s CPU usage is high, it is considered normal consumption. In version 5.3, there is a tidb_enable_tso_follower_proxy setting that allows PD followers to forward requests, which can reduce the PD leader’s CPU usage but may increase latency.
  3. I see that many people use JuiceFS with raw KV, but you also have a TiDB node. Are you using transaction mode? Which metric are you using to measure TiKV QPS? From your description, it seems like you are using raw KV.
  4. Version 5.3.0 is a basic version; it is recommended to upgrade to the newer version 5.3.4.
| username: TiDBer_fancy | Original post link

  1. Our read and write access volume is relatively large. I don’t know if this CPU usage is normal.
  2. This parameter currently does not need to be adjusted; the host configuration is still sufficient.
  3. Only one TiDB is used mainly for GC, actual read and write are all raw KV.
  4. Are there any pitfalls in upgrading to 5.3.4?
| username: TiDBer_fancy | Original post link

  1. CPU usage peaks at 64 cores, generally maintaining around 50 cores.
    image

  2. PD has a large number of “read: connection reset by peer” errors.

| username: Jellybean | Original post link

Is the network access between nodes normal? Have you checked whether the versions of nodes with different roles are consistent? Are there any abnormal conditions in business access?

| username: TiDBer_fancy | Original post link

The access between nodes is normal, and the versions installed with tiup should be consistent, right? There are no obvious anomalies in the business.

| username: WalterWj | Original post link

I feel it is a false alarm, the log level is info.
Let the R&D team take a look in the feedback area. If it has an impact, it shouldn’t be at the info level. If it doesn’t have an impact, logging a lot also seems problematic.