TiDB Cluster Fails to Start

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB集群无法启动

| username: xiaofagn

[TiDB Usage Environment] Production Environment / Test / Poc
[TiDB Version] 6.5
[Encountered Problem] TiDB cluster crashed for unknown reasons
[Reproduction Path] No operations
[Problem Phenomenon and Impact]
After the cluster crashes, it cannot be restarted normally. PD starts normally when started alone, and TiKV starts normally when started alone, but after a while, the whole system goes down again.
PD log:

[2023/06/15 15:24:39.106 +08:00] [WARN] [v3_server.go:830] [“timed out waiting for read index response (local node might have slow network)”] [timeout=11s]
[2023/06/15 15:24:39.149 +08:00] [WARN] [etcd_kv.go:160] [“txn runs too slow”] [response=null] [cost=10.00018136s] [error=“context deadline exceeded”]
[2023/06/15 15:24:39.149 +08:00] [WARN] [allocator_manager.go:629] [“failed to update allocator’s timestamp”] [dc-location=global] [error=“[PD:etcd:ErrEtcdKVPut]context deadline exceeded: context deadline exceeded”]
[2023/06/15 15:24:39.149 +08:00] [INFO] [tso.go:416] [“reset the timestamp in memory”]
[2023/06/15 15:24:39.149 +08:00] [INFO] [lease.go:159] [“stop lease keep alive worker”] [purpose=“pd leader election”]
[2023/06/15 15:24:39.188 +08:00] [INFO] [server.go:1522] [“no longer a leader because lease has expired, pd leader will step down”]
[2023/06/15 15:24:39.606 +08:00] [WARN] [v3_server.go:814] [“waiting for ReadIndex response took too long, retrying”] [sent-request-id=13642960994652025282] [retry-timeout=500ms]
[2023/06/15 15:24:39.707 +08:00] [WARN] [util.go:163] [“apply request took too long”] [took=2.00025902s] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/topology/tidb/" range_end:"/topology/tidb0" “] [response=] [error=“context canceled”]
[2023/06/15 15:24:39.707 +08:00] [INFO] [trace.go:152] [“trace[553154447] range”] [detail=”{range_begin:/topology/tidb/; range_end:/topology/tidb0; }”] [duration=2.000356001s] [start=2023/06/15 15:24:37.706 +08:00] [end=2023/06/15 15:24:39.707 +08:00] [steps=”["trace[553154447] ‘agreement among raft nodes before linearized reading’ (duration: 2.000274096s)"]”]
[2023/06/15 15:24:40.107 +08:00] [WARN] [v3_server.go:814] [“waiting for ReadIndex response took too long, retrying”] [sent-request-id=13642960994652025282] [retry-timeout=500ms]
[2023/06/15 15:24:40.607 +08:00] [WARN] [v3_server.go:814] [“waiting for ReadIndex response took too long, retrying”] [sent-request-id=13642960994652025282] [retry-timeout=500ms]
[2023/06/15 15:24:40.982 +08:00] [WARN] [util.go:163] [“apply request took too long”] [took=5.000155428s] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/tidb/store/gcworker/saved_safe_point" “] [response=] [error=“context canceled”]
[2023/06/15 15:24:40.982 +08:00] [INFO] [trace.go:152] [“trace[1517060739] range”] [detail=”{range_begin:/tidb/store/gcworker/saved_safe_point; range_end:; }”] [duration=5.000265051s] [start=2023/06/15 15:24:35.981 +08:00] [end=2023/06/15 15:24:40.982 +08:00] [steps=”["trace[1517060739] ‘agreement among raft nodes before linearized reading’ (duration: 5.000170095s)"]”]
[2023/06/15 15:24:41.108 +08:00] [WARN] [v3_server.go:814] [“waiting for ReadIndex response took too long, retrying”] [sent-request-id=13642960994652025282] [retry-timeout=500ms]
[2023/06/15 15:24:41.609 +08:00] [WARN] [v3_server.go:814] [“waiting for ReadIndex response took too long, retrying”] [sent-request-id=13642960994652025282] [retry-timeout=500ms]
[2023/06/15 15:24:42.110 +08:00] [WARN] [v3_server.go:814] [“waiting for ReadIndex response took too long, retrying”] [sent-request-id=13642960994652025282] [retry-timeout=500ms]
[2023/06/15 15:24:42.611 +08:00] [WARN] [v3_server.go:814] [“waiting for ReadIndex response took too long, retrying”] [sent-request-id=13642960994652025282] [retry-timeout=500ms]
[2023/06/15 15:24:42.711 +08:00] [WARN] [util.go:163] [“apply request took too long”] [took=10.000255023s] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/pd/7204350633511442718/scheduler_config/" range_end:"/pd/7204350633511442718/scheduler_config0" limit:1000 “] [response=] [error=“context deadline exceeded”]
[2023/06/15 15:24:42.711 +08:00] [WARN] [etcdutil.go:121] [“kv gets too slow”] [request-key=/pd/7204350633511442718/scheduler_config/] [cost=10.000457391s] [error=“context deadline exceeded”]
[2023/06/15 15:24:42.711 +08:00] [INFO] [trace.go:152] [“trace[1357921245] range”] [detail=”{range_begin:/pd/7204350633511442718/scheduler_config/; range_end:/pd/7204350633511442718/scheduler_config0; }”] [duration=10.000356053s] [start=2023/06/15 15:24:32.710 +08:00] [end=2023/06/15 15:24:42.711 +08:00] [steps=”["trace[1357921245] ‘agreement among raft nodes before linearized reading’ (duration: 10.000267673s)"]”]
[2023/06/15 15:24:42.711 +08:00] [ERROR] [etcdutil.go:126] [“load from etcd meet error”] [key=/pd/7204350633511442718/scheduler_config/] [error=“[PD:etcd:ErrEtcdKVGet]context deadline exceeded: context deadline exceeded”]
[2023/06/15 15:24:43.111 +08:00] [WARN] [v3_server.go:814] [“waiting for ReadIndex response took too long, retrying”] [sent-request-id=13642960994652025282] [retry-timeout=500ms]
[2023/06/15 15:24:43.612 +08:00] [WARN] [v3_server.go:814] [“waiting for ReadIndex response took too long, retrying”] [sent-request-id=13642960994652025282] [retry-timeout=500ms]
[2023/06/15 15:24:46.115 +08:00] [WARN] [v3_server.go:814] [“waiting for ReadIndex response took too long, retrying”] [sent-request-id=13642960994652025282] [retry-timeout=500ms]
[2023/06/15 15:24:46.148 +08:00] [WARN] [util.go:163] [“apply request took too long”] [took=10.000001343s] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/pd/7204350633511442718/gc/safe_point" “] [response=] [error=“context deadline exceeded”]
[2023/06/15 15:24:46.148 +08:00] [WARN] [util.go:163] [“apply request took too long”] [took=10.000079733s] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/pd/7204350633511442718/gc/safe_point" “] [response=] [error=“context deadline exceeded”]
[2023/06/15 15:24:46.148 +08:00] [WARN] [etcdutil.go:121] [“kv gets too slow”] [request-key=/pd/7204350633511442718/gc/safe_point] [cost=10.00022946s] [error=“context deadline exceeded”]
[2023/06/15 15:24:46.148 +08:00] [INFO] [trace.go:152] [“trace[1654063221] range”] [detail=”{range_begin:/pd/7204350633511442718/gc/safe_point; range_end:; }”] [duration=10.000089974s] [start=2023/06/15 15:24:36.148 +08:00] [end=2023/06/15 15:24:46.148 +08:00] [steps=”["trace[1654063221] ‘agreement among raft nodes before linearized reading’ (duration: 10.000008688s)"]”]
[2023/06/15 15:24:46.148 +08:00] [WARN] [etcdutil.go:121] [“kv gets too slow”] [request-key=/pd/7204350633511442718/gc/safe_point] [cost=10.000311315s] [error=“context deadline exceeded”]
[2023/06/15 15:24:46.148 +08:00] [ERROR] [etcdutil.go:126] [“load from etcd meet error”] [key=/pd/7204350633511442718/gc/safe_point] [error=”[PD:etcd:ErrEtcdKVGet]context deadline exceeded: context deadline exceeded”]
[2023/06/15 15:24:46.148 +08:00] [ERROR] [etcdutil.go:126] [“load from etcd meet error”] [key=/pd/7204350633511442718/gc/safe_point] [error=”[PD:etcd:ErrEtcdKVGet]context deadline exceeded: context deadline exceeded”]
[2023/06/15 15:24:46.148 +08:00] [INFO] [trace.go:152] [“trace[120052108] range”] [detail=“{range_begin:/pd/7204350633511442718/gc/safe_point; range_end:; }”] [duration=10.00015238s] [start=2023/06/15 15:24:36.148 +08:00] [end=2023/06/15 15:24:46.148 +08:00] [steps=“["trace[120052108] ‘agreement among raft nodes before linearized reading’ (duration: 10.000093416s)"]”]
[2023/06/15 15:24:46.148 +08:00] [WARN] [etcdutil.go:121] [“kv gets too slow”] [request-key=/pd/7204350633511442718/gc/safe_point] [cost=10.000032107s] [error=“context deadline exceeded”]
[2023/06/15 15:24:46.148 +08:00] [ERROR] [etcdutil.go:126] [“load from etcd meet error”] [key=/pd/7204350633511442718/gc/safe_point] [error=“[PD:etcd:ErrEtcdKVGet]context deadline exceeded: context deadline exceeded”]
[2023/06/15 15:24:46.148 +08:00] [WARN] [util.go:163] [“apply request took too long”] [took=9.999966894s] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/pd/7204350633511442718/gc/safe_point" “] [response=] [error=“context canceled”]
[2023/06/15 15:24:46.148 +08:00] [WARN] [etcdutil.go:121] [“kv gets too slow”] [request-key=/pd/7204350633511442718/gc/safe_point] [cost=10.000055744s] [error=“context deadline exceeded”]
[2023/06/15 15:24:46.148 +08:00] [INFO] [trace.go:152] [“trace[1371381750] range”] [detail=”{range_begin:/pd/7204350633511442718/gc/safe_point; range_end:; }”] [duration=9.999999936s] [start=2023/06/15 15:24:36.148 +08:00] [end=2023/06/15 15:24:46.148 +08:00] [steps=”["trace[1371381750] ‘agreement among raft nodes before linearized reading’ (duration: 9.999972719s)"]”]
[2023/06/15 15:24:46.148 +08:00] [ERROR] [etcdutil.go:126] [“load from etcd meet error”] [key=/pd/7204350633511442718/gc/safe_point] [error=“[PD:etcd:ErrEtcdKVGet]context deadline exceeded: context deadline exceeded”]
[2023/06/15 15:24:46.148 +08:00] [WARN] [util.go:163] [“apply request took too long”] [took=9.999984831s] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/pd/7204350633511442718/gc/safe_point" “] [response=] [error=“context canceled”]
[2023/06/15 15:24:46.148 +08:00] [WARN] [util.go:163] [“apply request took too long”] [took=9.999982091s] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/pd/7204350633511442718/gc/safe_point" “] [response=] [error=“context canceled”]
[2023/06/15 15:24:46.148 +08:00] [INFO] [trace.go:152] [“trace[706522399] range”] [detail=”{range_begin:/pd/7204350633511442718/gc/safe_point; range_end:; }”] [duration=10.000011961s] [start=2023/06/15 15:24:36.148 +08:00] [end=2023/06/15 15:24:46.148 +08:00] [steps=”["trace[706522399] ‘agreement among raft nodes before linearized reading’ (duration: 9.999989085s)"]”]
[2023/06/15 15:24:46.148 +08:00] [INFO] [trace.go:152] [“trace[567413646] range”] [detail=”{range_begin:/pd/7204350633511442718/gc/safe_point; range_end:; }”] [duration=10.000082512s] [start=2023/06/15 15:24:36.148 +08:00] [end=2023/06/15 15:24:46.148 +08:00] [steps=”["trace[567413646] ‘agreement among raft nodes before linearized reading’ (duration: 9.99999037s)"]”]
[2023/06/15 15:24:46.149 +08:00] [WARN] [etcdutil.go:121] [“kv gets too slow”] [request-key=/pd/7204350633511442718/gc/safe_point] [cost=10.000056983s] [error=“context deadline exceeded”]
[2023/06/15 15:24:46.149 +08:00] [ERROR] [etcdutil.go:126] [“load from etcd meet error”] [key=/pd/7204350633511442718/gc/safe_point] [error=“[PD:etcd:ErrEtcdKVGet]context deadline exceeded: context deadline exceeded”]
[2023/06/15 15:24:46.149 +08:00] [WARN] [util.go:163] [“apply request took too long”] [took=9.999967982s] [expected-duration=100ms] [prefix="read-only range "] [request="key:"/pd/7204350633511442718/gc/safe_point" "] [response=] [error=“context canceled”]
[2023/06/15 15:24:46.149 +08:00] [WARN] [etcdutil.go:121] [“kv gets too slow”] [request-key

| username: tidb菜鸟一只 | Original post link

Use tiup cluster display cluster_name to check whether TiDB or TiKV is not up, and then check the corresponding logs.

| username: xiaofagn | Original post link

Executing tiup cluster start xx can bring up the cluster, but after a while, it goes down again.

tikv.log:

[2023/06/15 16:16:09.399 +08:00] [ERROR] [util.rs:456] ["request failed, retry"] [err_code=KV:Pd:Grpc] [err="Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"not leader\", details: [] }))"]
[2023/06/15 16:16:09.532 +08:00] [INFO] [util.rs:598] ["connecting to PD endpoint"] [endpoints=http://tidb02:12379]
[2023/06/15 16:16:09.536 +08:00] [INFO] [util.rs:598] ["connecting to PD endpoint"] [endpoints=http://tidb03:12379]
[2023/06/15 16:16:09.540 +08:00] [INFO] [util.rs:598] ["connecting to PD endpoint"] [endpoints=http://tidb01:12379]
[2023/06/15 16:16:09.543 +08:00] [WARN] [client.rs:149] ["failed to update PD client"] [error="Other(\"[components/pd_client/src/util.rs:682]: failed to connect to [name: \\\"pd-2\\\" member_id: 2607808339097435534 peer_urls: \\\"http://tidb02:12380\\\" client_urls: \\\"http://tidb02:12379\\\", name: \\\"pd-1\\\" member_id: 3712024390580092245 peer_urls: \\\"http://tidb01:12380\\\" client_urls: \\\"http://tidb01:12379\\\", name: \\\"pd-3\\\" member_id: 14098604882267672469 peer_urls: \\\"http://tidb03:12380\\\" client_urls: \\\"http://tidb03:12379\\\"]\")"]
[2023/06/15 16:16:09.669 +08:00] [ERROR] [util.rs:456] ["request failed, retry"] [err_code=KV:Pd:Grpc] [err="Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"not leader\", details: [] }))"]
[2023/06/15 16:16:09.669 +08:00] [INFO] [util.rs:598] ["connecting to PD endpoint"] [endpoints=http://tidb02:12379]
[2023/06/15 16:16:09.670 +08:00] [ERROR] [util.rs:456] ["request failed, retry"] [err_code=KV:Pd:Grpc] [err="Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"not leader\", details: [] }))"]
[2023/06/15 16:16:09.670 +08:00] [ERROR] [util.rs:456] ["request failed, retry"] [err_code=KV:Pd:Grpc] [err="Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"not leader\", details: [] }))"]
[2023/06/15 16:16:09.670 +08:00] [ERROR] [util.rs:456] ["request failed, retry"] [err_code=KV:Pd:Grpc] [err="Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"not leader\", details: [] }))"]
[2023/06/15 16:16:09.670 +08:00] [ERROR] [util.rs:456] ["request failed, retry"] [err_code=KV:Pd:Grpc] [err="Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"not leader\", details: [] }))"]
[2023/06/15 16:16:09.670 +08:00] [ERROR] [util.rs:456] ["request failed, retry"] [err_code=KV:Pd:Grpc] [err="Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"not leader\", details: [] }))"]
[2023/06/15 16:16:09.670 +08:00] [ERROR] [util.rs:456] ["request failed, retry"] [err_code=KV:Pd:Grpc] [err="Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"not leader\", details: [] }))"]
[2023/06/15 16:16:09.670 +08:00] [ERROR] [util.rs:456] ["request failed, retry"] [err_code=KV:Pd:Grpc] [err="Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"not leader\", details: [] }))"]
[2023/06/15 16:16:09.670 +08:00] [ERROR] [util.rs:456] ["request failed, retry"] [err_code=KV:Pd:Grpc] [err="Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"not leader\", details: [] }))"]
[2023/06/15 16:16:09.670 +08:00] [ERROR] [util.rs:456] ["request failed, retry"] [err_code=KV:Pd:Grpc] [err="Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"not leader\", details: [] }))"]
[2023/06/15 16:16:09.670 +08:00] [ERROR] [util.rs:456] ["request failed, retry"] [err_code=KV:Pd:Grpc] [err="Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"not leader\", details: [] }))"]
[2023/06/15 16:16:09.670 +08:00] [ERROR] [util.rs:456] ["request failed, retry"] [err_code=KV:Pd:Grpc] [err="Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"not leader\", details: [] }))"]
[2023/06/15 16:16:09.678 +08:00] [INFO] [util.rs:598] ["connecting to PD endpoint"] [endpoints=http://tidb03:12379]
[2023/06/15 16:16:09.681 +08:00] [INFO] [util.rs:598] ["connecting to PD endpoint"] [endpoints=http://tidb01:12379]
[2023/06/15 16:16:09.844 +08:00] [INFO] [util.rs:598] ["connecting to PD endpoint"] [endpoints=http://tidb02:12379]
[2023/06/15 16:16:09.848 +08:00] [INFO] [util.rs:598] ["connecting to PD endpoint"] [endpoints=http://tidb03:12379]
[2023/06/15 16:16:09.851 +08:00] [INFO] [util.rs:598] ["connecting to PD endpoint"] [endpoints=http://tidb01:12379]
[2023/06/15 16:16:09.853 +08:00] [WARN] [client.rs:149] ["failed to update PD client"] [error="Other(\"[components/pd_client/src/util.rs:682]: failed to connect to [name: \\\"pd-2\\\" member_id: 2607808339097435534 peer_urls: \\\"http://tidb02:12380\\\" client_urls: \\\"http://tidb02:12379\\\", name: \\\"pd-1\\\" member_id: 3712024390580092245 peer_urls: \\\"http://tidb01:12380\\\" client_urls: \\\"http://tidb01:12379\\\", name: \\\"pd-3\\\" member_id: 14098604882267672469 peer_urls: \\\"http://tidb03:12380\\\" client_urls: \\\"http://tidb03:12379\\\"]\")"]
| username: Anna | Original post link

Check the resource usage and network.

| username: xiaofagn | Original post link

Monitoring shows that the disk IO utilization of two machines is very high, at 100%.

| username: redgame | Original post link

Could there be an issue with the time?

| username: zhanggame1 | Original post link

High I/O is probably not the cause. First, check if the network is functioning properly, if there are any packet losses, and if the time synchronization is normal. Restart the server if necessary.

| username: xiaofagn | Original post link

The times of each node are not very consistent, with differences at the second level.

| username: 我是咖啡哥 | Original post link

local node might have slow network!!

| username: 有猫万事足 | Original post link

The ntpd service might not have started, right?

| username: Hacker_xUwtuKxa | Original post link

It should be an issue with PD. From the logs above, there are problems with both etcd load and save to etcd on the PD side.
[2023/06/15 15:24:46.149 +08:00] [ERROR] [etcdutil.go:126] [“load from etcd meet error”] [key=/pd/7204350633511442718/gc/safe_point] [error=“[PD:etcd:ErrEtcdKVGet]context deadline exceeded: context deadline exceeded”]
[2023/06/15 15:24:47.575 +08:00] [ERROR] [etcd_kv.go:97] [“save to etcd meet error”] [key=/pd/7204350633511442718/raft/s/00000000000000000002] [value=“\u0008\u0002\u0012\u000ctidb01:20161”\n\n\u0004zone\u0012\u0002z1"\n\n\u0004rack\u0012\u0002r1"\u000e\n\u0004host\u0012\u0006tidb01"\u0008\n\u0002dc\u0012\u0002d1*\u00056.5.02\u000ctidb01:20161:\u000ctidb01:20181B(47b81680f75adc4b7200480cea5dbe46ae07c4b5H䭪\ufffd\u0006R$/u01/tidb/tidb-deploy/tikv-20161/binX\ufffd\ufffd\ufffd\ufffd戱\ufffd\u0017h\u0001"] [error=“[PD:etcd:ErrEtcdKVPut]context deadline exceeded: context deadline exceeded”]
Use tiup to check the status of PD. How many PDs have you configured? Can you log in to pd-ctl to see if PD leader transform can be done?

| username: xiaofagn | Original post link

The heavy load on the virtual machine’s underlying layer caused the cluster heartbeat to fail to communicate properly, resulting in the cluster being in a constant leader election state. After migrating the underlying equipment, the cluster started and returned to normal.
Thank you for your response.

| username: 中国电信TIKV | Original post link

Our high-performance machines are experiencing the same error, with resource usage not being high. It’s the same version. Is it a version issue?

| username: 中国电信TIKV | Original post link

I don’t know if this fix in version 7.1 is specifically for this issue.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.