TiKV Error: KV:Raft:StepLocalMsg

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikb报错:KV:Raft:StepLocalMsg

| username: xiaogangfighting

【TiDB Usage Environment】Production Environment
【TiDB Version】v7.1.0
【Encountered Problem: Phenomenon and Impact】
One of the TiKV logs reports an error:
[2023/10/16 09:44:46.004 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=59226] [region_id=59223]
[2023/10/16 09:44:48.006 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=59226] [region_id=59223]
[2023/10/16 09:44:48.176 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=59226] [region_id=59223]
[2023/10/16 09:44:50.008 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=59226] [region_id=59223]
[2023/10/16 09:44:52.010 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=59226] [region_id=59223]
[2023/10/16 09:44:54.012 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=59226] [region_id=59223]

【Resource Configuration】Enter TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page

| username: Fly-bird | Original post link

Heartbeat timed out? Check if each node is functioning properly.

| username: tidb菜鸟一只 | Original post link

Check the status of region 59223 using pdctl.

| username: 大飞哥online | Original post link

The information “KV:Raft:StepLocalMsg” in the TiKV logs indicates that TiKV is processing local Raft messages.

Please also post the context of the error log for further examination.

| username: xiaogangfighting | Original post link

[2023/10/17 09:40:18.007 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=1033] [region_id=1030]
[2023/10/17 09:40:18.253 +08:00] [INFO] [apply.rs:1690] [“execute admin command”] [command=“cmd_type: ChangePeerV2 change_peer_v2 { changes { change_type: AddLearnerNode peer { id: 791328 store_id: 3002 role: Learner } } }”] [index=364329] [term=7] [peer_id=59037] [region_id=59034]
[2023/10/17 09:40:18.253 +08:00] [INFO] [apply.rs:2283] [“exec ConfChangeV2”] [epoch=“conf_ver: 278169 version: 224”] [kind=Simple] [peer_id=59037] [region_id=59034]
[2023/10/17 09:40:18.253 +08:00] [INFO] [apply.rs:2464] [“conf change successfully”] [“current region”=“id: 59034 start_key: 7480000000000000FF6A5F720000000000FA end_key: 7480000000000000FF6B00000000000000F8 region_epoch { conf_ver: 278170 version: 224 } peers { id: 59035 store_id: 1 } peers { id: 59036 store_id: 5 } peers { id: 59037 store_id: 2 } peers { id: 791161 store_id: 231 role: Learner } peers { id: 791164 store_id: 3001 role: Learner } peers { id: 791328 store_id: 3002 role: Learner }”] [“original region”=“id: 59034 start_key: 7480000000000000FF6A5F720000000000FA end_key: 7480000000000000FF6B00000000000000F8 region_epoch { conf_ver: 278169 version: 224 } peers { id: 59035 store_id: 1 } peers { id: 59036 store_id: 5 } peers { id: 59037 store_id: 2 } peers { id: 791161 store_id: 231 role: Learner } peers { id: 791164 store_id: 3001 role: Learner }”] [changes=“[change_type: AddLearnerNode peer { id: 791328 store_id: 3002 role: Learner }]”] [peer_id=59037] [region_id=59034]
[2023/10/17 09:40:18.256 +08:00] [INFO] [raft.rs:2668] [“switched to configuration”] [config=“Configuration { voters: Configuration { incoming: Configuration { voters: {59036, 59037, 59035} }, outgoing: Configuration { voters: {} } }, learners: {791164, 791161, 791328}, learners_next: {}, auto_leave: false }”] [raft_id=59037] [region_id=59034]
[2023/10/17 09:40:18.648 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=1033] [region_id=1030]
[2023/10/17 09:40:20.649 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=1033] [region_id=1030]
[2023/10/17 09:40:22.651 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=1033] [region_id=1030]
[2023/10/17 09:40:24.653 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=1033] [region_id=1030]
[2023/10/17 09:40:24.765 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=1033] [region_id=1030]
[2023/10/17 09:40:26.655 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=1033] [region_id=1030]
[2023/10/17 09:40:28.657 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=1033] [region_id=1030]
[2023/10/17 09:40:30.659 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=1033] [region_id=1030]

| username: 大飞哥online | Original post link

Check the machines where the errored regions are located.

| username: 大飞哥online | Original post link

I feel like there’s a problem with the disk.
Just a feeling :laughing:

| username: 像风一样的男子 | Original post link

Are there any other errors similar to “Region is unavailable”?

| username: xiaogangfighting | Original post link

There is no “Region is unavailable” error, and I see that the cluster status is normal when checked with tiup.

| username: xiaogangfighting | Original post link

Looking at the Grafana dashboard, the machine’s disk read/write latency is normal.

| username: 像风一样的男子 | Original post link

Observe the logs more closely; it could be a disk issue.

| username: 大飞哥online | Original post link

Use pdctl to check the status information of the region that reported an error.

| username: xiaogangfighting | Original post link

[2023/10/17 10:35:34.611 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=59547] [region_id=59544]

Status Information:
» region 59544
{
“id”: 59544,
“start_key”: “7480000000000000FF705F720000000000FA”,
“end_key”: “7480000000000000FF7100000000000000F8”,
“epoch”: {
“conf_ver”: 339484,
“version”: 236
},
“peers”: [
{
“id”: 59545,
“store_id”: 1,
“role_name”: “Voter”
},
{
“id”: 59546,
“store_id”: 5,
“role_name”: “Voter”
},
{
“id”: 59547,
“store_id”: 2,
“role_name”: “Voter”
},
{
“id”: 791334,
“store_id”: 3001,
“role”: 1,
“role_name”: “Learner”,
“is_learner”: true
},
{
“id”: 791340,
“store_id”: 231,
“role”: 1,
“role_name”: “Learner”,
“is_learner”: true
},
{
“id”: 791349,
“store_id”: 3002,
“role”: 1,
“role_name”: “Learner”,
“is_learner”: true
}
],
“leader”: {
“id”: 59547,
“store_id”: 2,
“role_name”: “Voter”
},
“pending_peers”: [
{
“id”: 791349,
“store_id”: 3002,
“role”: 1,
“role_name”: “Learner”,
“is_learner”: true
}
],
“cpu_usage”: 0,
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 111,
“approximate_keys”: 664934
}

| username: 大飞哥online | Original post link

There are three tiflush, I see three learners :joy:

| username: 大飞哥online | Original post link

The error message indicates that there is an issue with this store ID. Check the IP of this machine and see if other region errors also point to this IP.

| username: xiaogangfighting | Original post link

“store_id”: 2, this one too, how should this be handled?

| username: 像风一样的男子 | Original post link

Check if this kv has any corrupted sst files:

| username: 大飞哥online | Original post link

tikvctl prints out the corrupted SST, just follow the suggestions to repair it. If you have time, you can also scan the machine’s disk to ensure the disk is okay.

| username: xiaogangfighting | Original post link

[root@is-pcstore-pro-dc-tidb-01 ~]# tiup ctl:v7.1.0 tikv --data-dir /data/tidb-data/tikv-20160/ bad-ssts --pd 10.194.132.113:2379
Starting component ctl: /root/.tiup/components/ctl/v7.1.0/ctl tikv --data-dir /data/tidb-data/tikv-20160/ bad-ssts --pd 10.194.132.113:2379

start to print bad ssts; data_dir:/data/tidb-data/tikv-20160/; db:/data/tidb-data/tikv-20160/db

corruption analysis has completed

| username: 大飞哥online | Original post link

That’s it, nothing more? :joy: