Error in KV Node After Upgrading from 7.1.4 to 7.1.5

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 7.1.4升级到7.1.5之后kv节点出错

| username: 雨一直下

【TiDB Usage Environment】Testing
【TiDB Version】7.1.5
【Reproduction Path】Upgrade from 7.1.4 to 7.1.5
【Encountered Issue: Problem Phenomenon and Impact】
KV node error, cluster status normal:


However, when querying the thread, it is found to be constantly committing:

Then check the log of one of the KV nodes:

> [2024/06/11 04:04:01.635 -04:00] [INFO] [<unknown>] ["Subchannel 0x7f08e136c580: Retry in 1000 milliseconds"]
> [2024/06/11 04:04:01.635 -04:00] [WARN] [raft_client.rs:296] ["RPC batch_raft fail"] [err="Some(RpcFailure(RpcStatus { status: 14-UNAVAILABLE, details: Some(\"failed to connect to all addresses\") }))"] [sink_err="Some(RpcFinished(Some(RpcStatus { status: 14-UNAVAILABLE, details: Some(\"failed to connect to all addresses\") })))"] [to_addr=192.168.0.241:20161]
> [2024/06/11 04:04:01.638 -04:00] [WARN] [raft_client.rs:199] ["send to 192.168.0.241:20161 fail, the gRPC connection could be broken"]
> [2024/06/11 04:04:01.638 -04:00] [ERROR] [transport.rs:163] ["send raft msg err"] [err="Other(\"[src/server/raft_client.rs:208]: RaftClient send fail\")"]
> [2024/06/11 04:04:01.638 -04:00] [INFO] [transport.rs:144] ["resolve store address ok"] [addr=192.168.0.241:20161] [store_id=1]
> [2024/06/11 04:04:01.638 -04:00] [INFO] [raft_client.rs:48] ["server: new connection with tikv endpoint"] [addr=192.168.0.241:20161]
> [2024/06/11 04:04:01.638 -04:00] [INFO] [<unknown>] ["Connect failed: {\"created\":\"@1718093041.638676386\",\"description\":\"Failed to connect to remote host: Connection refused\",\"errno\":111,\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.5.3/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":200,\"os_error\":\"Connection refused\",\"syscall\":\"connect\",\"target_address\":\"ipv4:192.168.0.241:20161\"}"]

All such errors, writing is very fast, hundreds of GB of logs
Check the log of 241:

[2024/06/11 04:04:59.416 -04:00] [INFO] [apply.rs:1699] ["execute admin command"] [command="cmd_type: ChangePeerV2 change_peer_v2 { changes { peer { id: 144727509 store_id: 1 } } changes { change_type: AddLearnerNode peer { id: 144053190 store_id: 874002 role: Learner } } }"] [index=147] [term=85] [peer_id=144558754] [region_id=144053189]
[2024/06/11 04:04:59.416 -04:00] [INFO] [apply.rs:2292] ["exec ConfChangeV2"] [epoch="conf_ver: 96 version: 445"] [kind=EnterJoint] [peer_id=144558754] [region_id=144053189]
[2024/06/11 04:04:59.416 -04:00] [INFO] [apply.rs:2473] ["conf change successfully"] ["current region"="id: 144053189 start_key: 7480000000000001FFED5F698000000000FF00000A0380000000FF00001D5303800000FF000000005E0419A5FFED17030000000380FF000000058B0CF600FE end_key: 7480000000000001FFED5F698000000000FF00000A0380000000FF00001D7503800000FF000000005E0419A7FF1B47030000000380FF00000003181D2300FE region_epoch { conf_ver: 98 version: 445 } peers { id: 144053190 store_id: 874002 role: DemotingVoter } peers { id: 144558754 store_id: 887001 } peers { id: 144718902 store_id: 4 } peers { id: 144727509 store_id: 1 role: IncomingVoter }"] ["original region"="id: 144053189 start_key: 7480000000000001FFED5F698000000000FF00000A0380000000FF00001D5303800000FF000000005E0419A5FFED17030000000380FF000000058B0CF600FE end_key: 7480000000000001FFED5F698000000000FF00000A0380000000FF00001D7503800000FF000000005E0419A7FF1B47030000000380FF00000003181D2300FE region_epoch { conf_ver: 96 version: 445 } peers { id: 144053190 store_id: 874002 } peers { id: 144558754 store_id: 887001 } peers { id: 144718902 store_id: 4 } peers { id: 144727509 store_id: 1 role: Learner }"] [changes="[peer { id: 144727509 store_id: 1 }, change_type: AddLearnerNode peer { id: 144053190 store_id: 874002 role: Learner }]"] [peer_id=144558754] [region_id=144053189]
[2024/06/11 04:04:59.417 -04:00] [INFO] [raft.rs:2660] ["switched to configuration"] [config="Configuration { voters: Configuration { incoming: Configuration { voters: {144727509, 144558754, 144718902} }, outgoing: Configuration { voters: {144558754, 144718902, 144053190} } }, learners: {}, learners_next: {144053190}, auto_leave: false }"] [raft_id=144558754] [region_id=144053189]
[2024/06/11 04:04:59.419 -04:00] [INFO] [apply.rs:1699] ["execute admin command"] [command="cmd_type: ChangePeerV2 change_peer_v2 {}"] [index=148] [term=85] [peer_id=144558754] [region_id=144053189]
[2024/06/11 04:04:59.419 -04:00] [INFO] [apply.rs:2292] ["exec ConfChangeV2"] [epoch="conf_ver: 98 version: 445"] [kind=LeaveJoint] [peer_id=144558754] [region_id=144053189]
[2024/06/11 04:04:59.419 -04:00] [INFO] [apply.rs:2503] ["leave joint state successfully"] [region="id: 144053189 start_key: 7480000000000001FFED5F698000000000FF00000A0380000000FF00001D5303800000FF000000005E0419A5FFED17030000000380FF000000058B0CF600FE end_key: 7480000000000001FFED5F698000000000FF00000A0380000000FF00001D7503800000FF000000005E0419A7FF1B47030000000380FF00000003181D2300FE region_epoch { conf_ver: 100 version: 445 } peers { id: 144053190 store_id: 874002 role: Learner } peers { id: 144558754 store_id: 887001 } peers { id: 144718902 store_id: 4 } peers { id: 144727509 store_id: 1 }"] [peer_id=144558754] [region_id=144053189]

It doesn’t seem like a big problem, how to solve it now, I have already restarted once, but it is still the same.
It’s very strange that my configuration file is all port 20160, why does it want to access port 20161?

| username: h5n1 | Original post link

Check the status and logs of this TiKV.

| username: 雨一直下 | Original post link

There is no 192.168.0.241:20161, only 192.168.0.241:20160. The cluster also shows 192.168.0.241:20160. I don’t know where 192.168.0.241:20161 came from.

| username: h5n1 | Original post link

Have you done scaling up or down before? Check with pd-ctl store.

| username: 雨一直下 | Original post link

There was no scaling, just the installation of other clusters:

| username: h5n1 | Original post link

What is the complete output? Check the processes on this node with ps -ef | grep tikv.

| username: 雨一直下 | Original post link

There is other node information, but this node only has this one store.

[root@node2 tikv-20160]#  ps -ef|grep tikv
tidb      1757     1 41 02:52 ?        00:41:56 bin/tikv-server --addr 0.0.0.0:20160 --advertise-addr 192.168.0.241:20160 --status-addr 0.0.0.0:20180 --advertise-status-addr 192.168.0.241:20180 --pd 192.168.0.246:2379 --data-dir /home/tidb/Data/data/tikv-20160 --config conf/tikv.toml --log-file /home/tidb/Data/log/tikv-20160/tikv.log
root     27214   496  0 04:32 pts/0    00:00:00 grep --color=auto tikv
| username: 雨一直下 | Original post link

I found the reason. It was another cluster. I have already stopped it, but the transaction hang still exists, causing data to be unable to be written.

| username: h5n1 | Original post link

The logs of this cluster contain information from another cluster node. Your cluster metadata is probably messed up. The IP ports of the PD nodes in the two clusters are the same, right?

| username: 雨一直下 | Original post link

It’s not the same. They are using different ports. Before the upgrade, they were running together and have been running together for several years.

| username: 雨一直下 | Original post link

Moreover, queries and such are all usable, but there is an issue with writing to a table, it keeps getting stuck. Right now, I don’t know why it keeps getting stuck.

| username: zhaokede | Original post link

Is the log output directory not isolated?

| username: 雨一直下 | Original post link

Isolated

| username: WalterWj | Original post link

If TiKV has identified this TiKV, it means that the connection between the two sides is established, but the metadata is messed up. When you were able to use it, it should have been a single replica. After adding it, it should have become multiple replicas. I understand that shutting it down again is problematic. You either need to rebuild the cluster or try Online recover to remove the corresponding node: Online Unsafe Recovery 使用文档 | PingCAP 文档中心