TiKV Node Error: "dispatch raft msg from gRPC to raftstore fail"

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiKV节点报错"dispatch raft msg from gRPC to raftstore fail"

| username: Leox

When I was using sysbench to insert data, the TiKV node reported the error shown in the picture (the TiKV node crashed around 09:34 according to the log). The command to insert data is:

sysbench --config-file=config1 oltp_common --tables=5 --table-size=100000000 prepare

Here are some screenshots from Grafana:

The disk space is definitely sufficient. Is there any solution without changing the scale of data insertion?

| username: Leox | Original post link

Try increasing TiKV’s raftstore.store-pool-size to see the result (8 → 32).

| username: songxuecheng | Original post link

Check the memory.

| username: Leox | Original post link

The memory is sufficient. The node that crashed has 256G of memory. I set storage.block-cache.capacity to 80G, but Grafana shows that it crashed when the memory reached 68.3G.

| username: Leox | Original post link

Increasing the raftstore.store-pool-size parameter did not solve the problem.

| username: Leox | Original post link

I want to ask if there are any settings in my TiKV that need to be modified.

| username: songxuecheng | Original post link

It’s a mixed deployment. Please also send the PD and TiDB logs and monitoring of the problematic node.

| username: Leox | Original post link

There are only two NUMA nodes on this node. I deployed one TiDB instance and one TiKV instance. The above image shows the TiDB log content, and it doesn’t seem to have any issues.

What specific information is needed for monitoring?

| username: songxuecheng | Original post link

Memory section

| username: Leox | Original post link

It is normal, only the TiKV node on 10.10.12.71 in the entire cluster is down.

And this downed node cannot be started.

| username: songxuecheng | Original post link

Use pd-ctl to check the store information to confirm whether the TiKV in the mixed deployment has been labeled.

| username: songxuecheng | Original post link

Try adjusting the value of server.max-grpc-send-msg-len and then test again.

| username: Leox | Original post link

The store information found by pd-ctl is as follows:

{
  "count": 3,
  "stores": [
    {
      "store": {
        "id": 5,
        "address": "10.10.12.78:20160",
        "labels": [
          {
            "key": "host",
            "value": "h1"
          },
          {
            "key": "zone",
            "value": "z0"
          }
        ],
        "version": "5.0.3",
        "status_address": "10.10.12.78:20180",
        "git_hash": "63b63edfbb9bbf8aeb875aad28c59f082eeb55d4",
        "start_timestamp": 1674902194,
        "deploy_path": "/data/tidb-deploy/tikv-20160/bin",
        "last_heartbeat": 1674959822476930860,
        "state_name": "Up"
      },
      "status": {
        "capacity": "2.864TiB",
        "available": "2.597TiB",
        "used_size": "56.64GiB",
        "leader_count": 874,
        "leader_weight": 1,
        "leader_score": 874,
        "leader_size": 75180,
        "region_count": 1748,
        "region_weight": 1,
        "region_score": 162305.90022865508,
        "region_size": 147112,
        "start_ts": "2023-01-28T10:36:34Z",
        "last_heartbeat_ts": "2023-01-29T02:37:02.47693086Z",
        "uptime": "16h0m28.47693086s"
      }
    },
    {
      "store": {
        "id": 1,
        "address": "10.10.12.78:20161",
        "labels": [
          {
            "key": "host",
            "value": "h1"
          },
          {
            "key": "zone",
            "value": "z1"
          }
        ],
        "version": "5.0.3",
        "status_address": "10.10.12.78:20181",
        "git_hash": "63b63edfbb9bbf8aeb875aad28c59f082eeb55d4",
        "start_timestamp": 1674902194,
        "deploy_path": "/data/tidb-deploy/tikv-20161/bin",
        "last_heartbeat": 1674959822977079499,
        "state_name": "Up"
      },
      "status": {
        "capacity": "2.864TiB",
        "available": "2.597TiB",
        "used_size": "56.63GiB",
        "leader_count": 874,
        "leader_weight": 1,
        "leader_score": 874,
        "leader_size": 71932,
        "region_count": 1748,
        "region_weight": 1,
        "region_score": 162305.90022885762,
        "region_size": 147112,
        "start_ts": "2023-01-28T10:36:34Z",
        "last_heartbeat_ts": "2023-01-29T02:37:02.977079499Z",
        "uptime": "16h0m28.977079499s"
      }
    },
    {
      "store": {
        "id": 4,
        "address": "10.10.12.71:20160",
        "labels": [
          {
            "key": "host",
            "value": "h3"
          },
          {
            "key": "zone",
            "value": "z0"
          }
        ],
        "version": "5.0.3",
        "status_address": "10.10.12.71:20180",
        "git_hash": "63b63edfbb9bbf8aeb875aad28c59f082eeb55d4",
        "start_timestamp": 1674959822,
        "deploy_path": "/data/tidb-deploy/tikv-20160/bin",
        "last_heartbeat": 1674917576942024622,
        "state_name": "Down"
      },
      "status": {
        "capacity": "2.864TiB",
        "available": "2.658TiB",
        "used_size": "61.8GiB",
        "leader_count": 0,
        "leader_weight": 1,
        "leader_score": 0,
        "leader_size": 0,
        "region_count": 1748,
        "region_weight": 1,
        "region_score": 162076.62883248986,
        "region_size": 147112,
        "start_ts": "2023-01-29T02:37:02Z",
        "last_heartbeat_ts": "2023-01-28T14:52:56.942024622Z"
      }
    }
  ]
}

Tikv has labeled.

| username: songxuecheng | Original post link

After adjusting the parameters above, check again. If there are still issues, look at the logs for region 648 and check the status of this region.

| username: Leox | Original post link

Okay, should I lower this parameter? I suddenly thought that increasing the grpc-concurrency parameter might also be effective. I’ll try both and give you feedback later.

| username: songxuecheng | Original post link

If it’s a region issue, you can first check using tikv-ctl --data-dir /path/to/tikv bad-regions.

| username: Leox | Original post link

The adjustment of these two parameters was unsuccessful, and the TiKV node still crashed. It seems that the bad-region check also failed :dotted_line_face:

| username: songxuecheng | Original post link

This needs to be run on the failed TiKV node.

| username: Leox | Original post link

I did execute this command on the node that crashed :joy:. I tried many times in the afternoon to completely delete and rebuild the cluster, and found that the probability of region 648 crashing is very high. I want to ask if there is anything special about this region.

| username: songxuecheng | Original post link

Check the region to see if there are any abnormal peers.