TiKV Node Error: "dispatch raft msg from gRPC to raftstore fail"

translator_bot · June 22, 2024, 6:51pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiKV节点报错"dispatch raft msg from gRPC to raftstore fail"

| username: Leox

When I was using sysbench to insert data, the TiKV node reported the error shown in the picture (the TiKV node crashed around 09:34 according to the log). The command to insert data is:

sysbench --config-file=config1 oltp_common --tables=5 --table-size=100000000 prepare

Here are some screenshots from Grafana:

The disk space is definitely sufficient. Is there any solution without changing the scale of data insertion?

translator_bot · June 22, 2024, 6:51pm

| username: Leox | Original post link

Try increasing TiKV’s raftstore.store-pool-size to see the result (8 → 32).

translator_bot · June 22, 2024, 6:51pm

| username: songxuecheng | Original post link

Check the memory.

translator_bot · June 22, 2024, 6:51pm

| username: Leox | Original post link

The memory is sufficient. The node that crashed has 256G of memory. I set storage.block-cache.capacity to 80G, but Grafana shows that it crashed when the memory reached 68.3G.

translator_bot · June 22, 2024, 6:51pm

| username: Leox | Original post link

Increasing the raftstore.store-pool-size parameter did not solve the problem.

translator_bot · June 22, 2024, 6:51pm

| username: Leox | Original post link

I want to ask if there are any settings in my TiKV that need to be modified.

translator_bot · June 22, 2024, 6:51pm

| username: songxuecheng | Original post link

It’s a mixed deployment. Please also send the PD and TiDB logs and monitoring of the problematic node.

translator_bot · June 22, 2024, 6:51pm

| username: Leox | Original post link

There are only two NUMA nodes on this node. I deployed one TiDB instance and one TiKV instance. The above image shows the TiDB log content, and it doesn’t seem to have any issues.

What specific information is needed for monitoring?

translator_bot · June 22, 2024, 6:51pm

| username: songxuecheng | Original post link

Memory section

translator_bot · June 22, 2024, 6:51pm

| username: Leox | Original post link

It is normal, only the TiKV node on 10.10.12.71 in the entire cluster is down.

And this downed node cannot be started.

translator_bot · June 22, 2024, 6:51pm

| username: songxuecheng | Original post link

Use pd-ctl to check the store information to confirm whether the TiKV in the mixed deployment has been labeled.

translator_bot · June 22, 2024, 6:51pm

| username: songxuecheng | Original post link

Try adjusting the value of server.max-grpc-send-msg-len and then test again.

translator_bot · June 22, 2024, 6:51pm

| username: Leox | Original post link

The store information found by pd-ctl is as follows:

{
  "count": 3,
  "stores": [
    {
      "store": {
        "id": 5,
        "address": "10.10.12.78:20160",
        "labels": [
          {
            "key": "host",
            "value": "h1"
          },
          {
            "key": "zone",
            "value": "z0"
          }
        ],
        "version": "5.0.3",
        "status_address": "10.10.12.78:20180",
        "git_hash": "63b63edfbb9bbf8aeb875aad28c59f082eeb55d4",
        "start_timestamp": 1674902194,
        "deploy_path": "/data/tidb-deploy/tikv-20160/bin",
        "last_heartbeat": 1674959822476930860,
        "state_name": "Up"
      },
      "status": {
        "capacity": "2.864TiB",
        "available": "2.597TiB",
        "used_size": "56.64GiB",
        "leader_count": 874,
        "leader_weight": 1,
        "leader_score": 874,
        "leader_size": 75180,
        "region_count": 1748,
        "region_weight": 1,
        "region_score": 162305.90022865508,
        "region_size": 147112,
        "start_ts": "2023-01-28T10:36:34Z",
        "last_heartbeat_ts": "2023-01-29T02:37:02.47693086Z",
        "uptime": "16h0m28.47693086s"
      }
    },
    {
      "store": {
        "id": 1,
        "address": "10.10.12.78:20161",
        "labels": [
          {
            "key": "host",
            "value": "h1"
          },
          {
            "key": "zone",
            "value": "z1"
          }
        ],
        "version": "5.0.3",
        "status_address": "10.10.12.78:20181",
        "git_hash": "63b63edfbb9bbf8aeb875aad28c59f082eeb55d4",
        "start_timestamp": 1674902194,
        "deploy_path": "/data/tidb-deploy/tikv-20161/bin",
        "last_heartbeat": 1674959822977079499,
        "state_name": "Up"
      },
      "status": {
        "capacity": "2.864TiB",
        "available": "2.597TiB",
        "used_size": "56.63GiB",
        "leader_count": 874,
        "leader_weight": 1,
        "leader_score": 874,
        "leader_size": 71932,
        "region_count": 1748,
        "region_weight": 1,
        "region_score": 162305.90022885762,
        "region_size": 147112,
        "start_ts": "2023-01-28T10:36:34Z",
        "last_heartbeat_ts": "2023-01-29T02:37:02.977079499Z",
        "uptime": "16h0m28.977079499s"
      }
    },
    {
      "store": {
        "id": 4,
        "address": "10.10.12.71:20160",
        "labels": [
          {
            "key": "host",
            "value": "h3"
          },
          {
            "key": "zone",
            "value": "z0"
          }
        ],
        "version": "5.0.3",
        "status_address": "10.10.12.71:20180",
        "git_hash": "63b63edfbb9bbf8aeb875aad28c59f082eeb55d4",
        "start_timestamp": 1674959822,
        "deploy_path": "/data/tidb-deploy/tikv-20160/bin",
        "last_heartbeat": 1674917576942024622,
        "state_name": "Down"
      },
      "status": {
        "capacity": "2.864TiB",
        "available": "2.658TiB",
        "used_size": "61.8GiB",
        "leader_count": 0,
        "leader_weight": 1,
        "leader_score": 0,
        "leader_size": 0,
        "region_count": 1748,
        "region_weight": 1,
        "region_score": 162076.62883248986,
        "region_size": 147112,
        "start_ts": "2023-01-29T02:37:02Z",
        "last_heartbeat_ts": "2023-01-28T14:52:56.942024622Z"
      }
    }
  ]
}

Tikv has labeled.

translator_bot · June 22, 2024, 6:51pm

| username: songxuecheng | Original post link

After adjusting the parameters above, check again. If there are still issues, look at the logs for region 648 and check the status of this region.

translator_bot · June 22, 2024, 6:51pm

| username: Leox | Original post link

Okay, should I lower this parameter? I suddenly thought that increasing the grpc-concurrency parameter might also be effective. I’ll try both and give you feedback later.

translator_bot · June 22, 2024, 6:51pm

| username: songxuecheng | Original post link

If it’s a region issue, you can first check using tikv-ctl --data-dir /path/to/tikv bad-regions.

translator_bot · June 22, 2024, 6:51pm

| username: Leox | Original post link

The adjustment of these two parameters was unsuccessful, and the TiKV node still crashed. It seems that the bad-region check also failed

translator_bot · June 22, 2024, 6:51pm

| username: songxuecheng | Original post link

This needs to be run on the failed TiKV node.

translator_bot · June 22, 2024, 6:51pm

| username: Leox | Original post link

I did execute this command on the node that crashed . I tried many times in the afternoon to completely delete and rebuild the cluster, and found that the probability of region 648 crashing is very high. I want to ask if there is anything special about this region.

translator_bot · June 22, 2024, 6:51pm

| username: songxuecheng | Original post link

Check the region to see if there are any abnormal peers.