Cluster fails to start after upgrading from v5.3.0 to v6.5.0

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: v5.3.0升级到v6.5.0后集群启动不了

| username: TiDBer_XsS0xKSu

[TiDB Usage Environment] Production Environment
[TiDB Version] v6.5.0
[2023/03/14 09:03:27.514 +08:00] [WARN] [retry_interceptor.go:61] [“retrying of unary invoker failed”] [target=endpoint://client-24ca359a-0253-4d2d-a51f-79ec0016c722/10.0.124.75:2379] [attempt=0] [error=“rpc error: code = DeadlineExceeded desc = context deadline exceeded”]

| username: Kongdom | Original post link

Is there any ERROR level error message?

| username: TiDBer_XsS0xKSu | Original post link

[2023/03/13 20:06:09.156 +08:00] [ERROR] [util.rs:419] [“request failed, retry”] [err_code=KV:PD:Unknown] [err=“Other("[components/pd_client/src/tso.rs:85]: TimestampRequest channel is closed")”]
[2023/03/13 20:06:09.157 +08:00] [ERROR] [util.rs:419] [“request failed, retry”] [err_code=KV:PD:Unknown] [err=“Other("[components/pd_client/src/tso.rs:85]: TimestampRequest channel is closed")”]
[2023/03/13 20:06:09.157 +08:00] [ERROR] [util.rs:419] [“request failed, retry”] [err_code=KV:PD:Unknown] [err=“Other("[components/pd_client/src/tso.rs:85]: TimestampRequest channel is closed")”]
[2023/03/13 20:06:09.157 +08:00] [INFO] [util.rs:544] [“connecting to PD endpoint”] [endpoints=http://10.0.124.75:2379]
[2023/03/13 20:06:09.157 +08:00] [ERROR] [util.rs:419] [“request failed, retry”] [err_code=KV:PD:Unknown] [err=“Other("[components/pd_client/src/tso.rs:85]: TimestampRequest channel is closed")”]
[2023/03/13 20:06:09.157 +08:00] [ERROR] [util.rs:419] [“request failed, retry”] [err_code=KV:PD:Unknown] [err=“Other("[components/pd_client/src/tso.rs:85]: TimestampRequest channel is closed")”]
[2023/03/13 20:06:09.157 +08:00] [ERROR] [util.rs:419] [“request failed, retry”] [err_code=KV:PD:Unknown] [err=“Other("[components/pd_client/src/tso.rs:85]: TimestampRequest channel is closed")”]
[2023/03/13 20:06:09.157 +08:00] [ERROR] [util.rs:419] [“request failed, retry”] [err_code=KV:PD:Unknown] [err=“Other("[components/pd_client/src/tso.rs:85]: TimestampRequest channel is closed")”]
[2023/03/13 20:06:09.157 +08:00] [ERROR] [util.rs:419] [“request failed, retry”] [err_code=KV:PD:Unknown] [err=“Other("[components/pd_client/src/tso.rs:85]: TimestampRequest channel is closed")”]
[2023/03/13 20:06:09.157 +08:00] [ERROR] [util.rs:419] [“request failed, retry”] [err_code=KV:PD:Unknown] [err=“Other("[components/pd_client/src/tso.rs:85]: TimestampRequest channel is closed")”]
[2023/03/13 20:06:09.157 +08:00] [ERROR] [util.rs:419] [“request failed, retry”] [err_code=KV:PD:Unknown] [err=“Other("[components/pd_client/src/tso.rs:85]: TimestampRequest channel is closed")”]
[2023/03/13 20:06:09.157 +08:00] [ERROR] [util.rs:419] [“request failed, retry”] [err_code=KV:PD:Unknown] [err=“Other("[components/pd_client/src/tso.rs:85]: TimestampRequest channel is closed")”]
[2023/03/13 20:06:09.157 +08:00] [ERROR] [util.rs:419] [“request failed, retry”] [err_code=KV:PD:Unknown] [err=“Other("[components/pd_client/src/tso.rs:85]: TimestampRequest channel is closed")”]
[2023/03/13 20:06:09.157 +08:00] [ERROR] [util.rs:419] [“request failed, retry”] [err_code=KV:PD:Unknown] [err=“Other("[components/pd_client/src/tso.rs:85]: TimestampRequest channel is closed")”]
[2023/03/13 20:06:09.157 +08:00] [ERROR] [util.rs:419] [“request failed, retry”] [err_code=KV:PD:Unknown] [err=“Other("[components/pd_client/src/tso.rs:85]: TimestampRequest channel is closed")”]
[2023/03/13 20:06:09.157 +08:00] [ERROR] [util.rs:419] [“request failed, retry”] [err_code=KV:PD:Unknown] [err=“Other(TrySendError { kind: Disconnected })”]
[2023/03/13 20:06:09.158 +08:00] [INFO] [util.rs:544] [“connecting to PD endpoint”] [endpoints=http://10.0.124.75:2379]
[2023/03/13 20:06:09.158 +08:00] [WARN] [client.rs:138] [“failed to update PD client”] [error=“Other("[components/pd_client/src/util.rs:598]: failed to connect to [name: \"pd-10.0.124.75-2379\" member_id: 6844741470744445767 peer_urls: \"http://10.0.124.75:2380\" client_urls: \"http://10.0.124.75:2379\"]")”]
[2023/03/13 20:06:09.158 +08:00] [INFO] [util.rs:544] [“connecting to PD endpoint”] [endpoints=http://10.0.124.75:2379]
[2023/03/13 20:06:09.158 +08:00] [WARN] [client.rs:138] [“failed to update PD client”] [error=“Other("[components/pd_client/src/util.rs:598]: failed to connect to [name: \"pd-10.0.124.75-2379\" member_id: 6844741470744445767 peer_urls: \"http://10.0.124.75:2380\" client_urls: \"http://10.0.124.75:2379\"]")”]
[2023/03/13 20:06:09.158 +08:00] [ERROR] [util.rs:419] [“request failed, retry”] [err_code=KV:PD:gRPC] [err=“Grpc(RpcFailure(RpcStatus { code: 2-UNKNOWN, message: "rpc error: code = Unavailable desc = not leader", details: }))”]
[2023/03/13 20:06:09.158 +08:00] [INFO] [util.rs:544] [“connecting to PD endpoint”] [endpoints=http://10.0.124.75:2379]
[2023/03/13 20:06:09.158 +08:00] [WARN] [client.rs:138] [“failed to update PD client”] [error=“Other("[components/pd_client/src/util.rs:306]: cancel reconnection due to too small interval")”]
[2023/03/13 20:06:09.158 +08:00] [ERROR] [util.rs:419] [“request failed, retry”] [err_code=KV:PD:gRPC] [err=“Grpc(RpcFailure(RpcStatus { code: 2-UNKNOWN, message: "rpc error: code = Unavailable desc = not leader", details: }))”]

| username: TiDBer_XsS0xKSu | Original post link

Then the TiKV log keeps printing this:
[2023/03/14 09:04:46.461 +08:00] [WARN] [pd.rs:1693] [“failed to update max timestamp for region 63021: Pd(Other("[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed"))”]
[2023/03/14 09:04:46.461 +08:00] [WARN] [pd.rs:1693] [“failed to update max timestamp for region 63021: Pd(Other("[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed"))”]
[2023/03/14 09:04:46.461 +08:00] [WARN] [pd.rs:1693] [“failed to update max timestamp for region 63021: Pd(Other("[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed"))”]
[2023/03/14 09:04:46.461 +08:00] [WARN] [pd.rs:1693] [“failed to update max timestamp for region 63021: Pd(Other("[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed"))”]
[2023/03/14 09:04:46.461 +08:00] [WARN] [pd.rs:1693] [“failed to update max timestamp for region 63021: Pd(Other("[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed"))”]
[2023/03/14 09:04:46.461 +08:00] [WARN] [pd.rs:1693] [“failed to update max timestamp for region 63021: Pd(Other("[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed"))”]
[2023/03/14 09:04:46.461 +08:00] [WARN] [pd.rs:1693] [“failed to update max timestamp for region 63021: Pd(Other("[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed"))”]
[2023/03/14 09:04:46.461 +08:00] [WARN] [pd.rs:1693] [“failed to update max timestamp for region 63021: Pd(Other("[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed"))”]
[2023/03/14 09:04:46.461 +08:00] [WARN] [pd.rs:1693] [“failed to update max timestamp for region 63021: Pd(Other("[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed"))”]
[2023/03/14 09:04:46.461 +08:00] [WARN] [pd.rs:1693] [“failed to update max timestamp for region 63021: Pd(Other("[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed"))”]
[2023/03/14 09:04:46.461 +08:00] [WARN] [pd.rs:1693] [“failed to update max timestamp for region 63021: Pd(Other("[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed"))”]
[2023/03/14 09:04:46.461 +08:00] [WARN] [pd.rs:1693] [“failed to update max timestamp for region 63021: Pd(Other("[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed"))”]
[2023/03/14 09:04:46.461 +08:00] [WARN] [pd.rs:1693] [“failed to update max timestamp for region 63021: Pd(Other("[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed"))”]
[2023/03/14 09:04:46.461 +08:00] [WARN] [pd.rs:1693] [“failed to update max timestamp for region 63021: Pd(Other("[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed"))”]
[2023/03/14 09:04:46.461 +08:00] [WARN] [pd.rs:1693] [“failed to update max timestamp for region 63021: Pd(Other("[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed"))”]
[2023/03/14 09:04:46.461 +08:00] [WARN] [pd.rs:1693] [“failed to update max timestamp for region 63021: Pd(Other("[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed"))”]
[2023/03/14 09:04:46.461 +08:00] [WARN] [pd.rs:1693] [“failed to update max timestamp for region 63021: Pd(Other("[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed"))”]

| username: Kongdom | Original post link

Are the PD nodes functioning properly? Especially the PD Leader node.

| username: TiDBer_XsS0xKSu | Original post link

Only one PD node cannot start normally.

| username: TiDBer_XsS0xKSu | Original post link

It was working fine last Friday, and then it crashed on Monday.

| username: 裤衩儿飞上天 | Original post link

What error is PD reporting?

| username: TiDBer_XsS0xKSu | Original post link

[2023/03/14 09:03:27.514 +08:00] [WARN] [retry_interceptor.go:61] [“retrying of unary invoker failed”] [target=endpoint://client-24ca359a-0253-4d2d-a51f-79ec0016c722/10.0.124.75:2379] [attempt=0] [error=“rpc error: code = DeadlineExceeded desc = context deadline exceeded”]

| username: Kongdom | Original post link

Check if the network is connected?

| username: TiDBer_XsS0xKSu | Original post link

The network is connected, and SSH can be accessed without a password.

| username: TiDBer_XsS0xKSu | Original post link

When starting up, this error is reported:
Error: failed to start tikv: failed to start: 10.0.124.78 tikv-20160.service, please check the instance’s log(/tidb-deploy/tikv-20160/log) for more detail.: timed out waiting for port 20160 to be started after 2m0s

| username: Kongdom | Original post link

Are the ports accessible? Check the firewall rules.

What error messages are in /tidb-deploy/tikv-20160/log?

| username: tidb菜鸟一只 | Original post link

The cluster startup operation will start all components of the entire TiDB cluster in the order of PD → TiKV → Pump → TiDB → TiFlash → Drainer → TiCDC → Prometheus → Grafana → Alertmanager. You need to ensure that your PD nodes can start normally before your TiKV nodes can start.

| username: TiDBer_XsS0xKSu | Original post link

PD has already started.

| username: TiDBer_XsS0xKSu | Original post link

[2023/03/14 11:12:21.988 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: "Deadline Exceeded", details: }))”]
[2023/03/14 11:12:23.001 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: "Deadline Exceeded", details: }))”]
[2023/03/14 11:12:23.484 +08:00] [INFO] [sst_importer.rs:442] [“shrink cache by tick”] [“retain size”=0] [“shrink size”=0]
[2023/03/14 11:12:23.989 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: "Deadline Exceeded", details: }))”]
[2023/03/14 11:12:26.324 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: "Deadline Exceeded", details: }))”]
[2023/03/14 11:12:26.324 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: "Deadline Exceeded", details: }))”]
[2023/03/14 11:12:26.324 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: "Deadline Exceeded", details: }))”]
[2023/03/14 11:12:26.324 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: "Deadline Exceeded", details: }))”]
[2023/03/14 11:12:26.324 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: "Deadline Exceeded", details: }))”]
[2023/03/14 11:12:26.324 +08:00] [INFO] [util.rs:598] [“connecting to PD endpoint”] [endpoints=http://10.0.124.79:2379]
[2023/03/14 11:12:26.324 +08:00] [INFO] [util.rs:598] [“connecting to PD endpoint”] [endpoints=http://10.0.124.79:2379]
[2023/03/14 11:12:26.325 +08:00] [INFO] [util.rs:763] [“connected to PD member”] [endpoints=http://10.0.124.79:2379]
[2023/03/14 11:12:26.325 +08:00] [INFO] [util.rs:220] [“heartbeat sender and receiver are stale, refreshing …”]
[2023/03/14 11:12:26.325 +08:00] [INFO] [util.rs:233] [“buckets sender and receiver are stale, refreshing …”]
[2023/03/14 11:12:26.325 +08:00] [INFO] [util.rs:260] [“update pd client”] [via=] [leader=http://10.0.124.79:2379] [prev_via=] [prev_leader=http://10.0.124.79:2379]
[2023/03/14 11:12:26.325 +08:00] [INFO] [util.rs:394] [“trying to update PD client done”] [spend=5.766267ms]
[2023/03/14 11:12:26.325 +08:00] [INFO] [tso.rs:162] [“TSO worker terminated”] [receiver_cause=None] [sender_cause=None]
[2023/03/14 11:12:26.325 +08:00] [INFO] [client.rs:147] [“TSO stream is closed, reconnect to PD”]
[2023/03/14 11:12:26.325 +08:00] [WARN] [client.rs:149] [“failed to update PD client”] [error=“Other("[components/pd_client/src/util.rs:337]: cancel reconnection due to too small interval")”]
[2023/03/14 11:12:26.325 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: "Deadline Exceeded", details: }))”]
[2023/03/14 11:12:26.325 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: "Deadline Exceeded", details: }))”]
[2023/03/14 11:12:26.325 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: "Deadline Exceeded", details: }))”]
[2023/03/14 11:12:26.325 +08:00] [INFO] [util.rs:598] [“connecting to PD endpoint”] [endpoints=http://10.0.124.79:2379]
[2023/03/14 11:12:26.325 +08:00] [INFO] [util.rs:598] [“connecting to PD endpoint”] [endpoints=http://10.0.124.79:2379]
[2023/03/14 11:12:26.325 +08:00] [INFO] [util.rs:763] [“connected to PD member”] [endpoints=http://10.0.124.79:2379]
[2023/03/14 11:12:26.325 +08:00] [INFO] [util.rs:220] [“heartbeat sender and receiver are stale, refreshing …”]
[2023/03/14 11:12:26.325 +08:00] [INFO] [util.rs:233] [“buckets sender and receiver are stale, refreshing …”]
[2023/03/14 11:12:26.325 +08:00] [INFO] [util.rs:260] [“update pd client”] [via=] [leader=http://10.0.124.79:2379] [prev_via=] [prev_leader=http://10.0.124.79:2379]
[2023/03/14 11:12:26.325 +08:00] [INFO] [util.rs:394] [“trying to update PD client done”] [spend=5.571407ms]
[2023/03/14 11:12:26.325 +08:00] [INFO] [tso.rs:162] [“TSO worker terminated”] [receiver_cause=None] [sender_cause=None]

| username: jansu-dev | Original post link

  1. [2023/03/13 20:06:09.158 +08:00] [ERROR] [util.rs:419] [“request failed, retry”] [err_code=KV:PD:gRPC] [err=“Grpc(RpcFailure(RpcStatus { code: 2-UNKNOWN, message: “rpc error: code = Unavailable desc = not leader”, details: }))”] —> This indicates that TiKV sent an RPC message to PD, but it reported an UNKNOWN error. Firstly, it can be confirmed that there is an issue with the TiKV connection to PD. Secondly, this UNKNOWN error feels like the PD or TiKV version hasn’t been upgraded. Confirm if all binaries are up to date (./pd-server --version or ./tikv-server --version).

  2. Having only one PD node that cannot start properly is quite dangerous for production! According to tidb菜鸟一只, please provide a complete pd.log to confirm that PD is running normally and to understand why it was “working fine last Friday but failed on Monday”.

  3. The current information is still not enough to solve this problem.

| username: liuis | Original post link

It has been mentioned that it’s an upgrade, so the network should be connected… This error doesn’t reveal much, there’s too little information.

| username: xingzhenxiang | Original post link

I need to upgrade from version 4.0x to 6.5x. Has the issue been resolved in the end?

| username: kkpeter | Original post link

  1. If this cannot be checked, simply stop checking. Delete this PD from the cluster and start a brand new PD to join the cluster. Ensure the PD is functioning normally first.
  2. Then check if there are any anomalies with TiKV.