TiKV Suddenly Unable to Read and Write Normally

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv突然无法正常读写

| username: 月明星稀

TiKV suddenly cannot read or write, and the PD error log is as follows. Experts, please take a look and advise on how to resolve this.

[2023/11/16 17:17:32.172 +08:00] [WARN] [util.go:163] [“apply request took too long”] [took=313.580658ms] [expected-duration=100ms] [prefix=] [request=“header:<ID:1224288430602737517 username:"pd.server.cn" auth_revision:1 > txn:<compare:<> success:<request_put:<key:"/pd/7283055360825016244/raft/s/00000000000000000005" value_size:167 >> failure:<>>”] [response=size:20]
[2023/11/16 17:17:32.172 +08:00] [INFO] [trace.go:152] [“trace[783287637] linearizableReadLoop”] [detail=“{readStateIndex:26719986; appliedIndex:26719985; }”] [duration=283.806938ms] [start=2023/11/16 17:17:31.888 +08:00] [end=2023/11/16 17:17:32.172 +08:00] [steps=“["trace[783287637] ‘read index received’ (duration: 463.039µs)","trace[783287637] ‘applied index is now lower than readState.Index’ (duration: 283.341135ms)"]”]
[2023/11/16 17:17:32.172 +08:00] [WARN] [util.go:163] [“apply request took too long”] [took=283.950121ms] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/pd/7283055360825016244/config" “] [response=“range_response_count:1 size:3843”]
[2023/11/16 17:17:32.172 +08:00] [INFO] [trace.go:152] [“trace[102465159] range”] [detail=”{range_begin:/pd/7283055360825016244/config; range_end:; response_count:1; response_revision:26718721; }”] [duration=284.030995ms] [start=2023/11/16 17:17:31.888 +08:00] [end=2023/11/16 17:17:32.172 +08:00] [steps=”["trace[102465159] ‘agreement among raft nodes before linearized reading’ (duration: 283.923158ms)"]”]
[2023/11/16 17:17:32.173 +08:00] [WARN] [util.go:163] [“apply request took too long”] [took=203.547403ms] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/tidb/store/gcworker/saved_safe_point" “] [response=“range_response_count:0 size:7”]
[2023/11/16 17:17:32.173 +08:00] [INFO] [trace.go:152] [“trace[1270457873] range”] [detail=”{range_begin:/tidb/store/gcworker/saved_safe_point; range_end:; response_count:0; response_revision:26718721; }”] [duration=203.659537ms] [start=2023/11/16 17:17:31.969 +08:00] [end=2023/11/16 17:17:32.173 +08:00] [steps=”["trace[1270457873] ‘agreement among raft nodes before linearized reading’ (duration: 203.538012ms)"]”]
[2023/11/16 17:17:37.832 +08:00] [WARN] [util.go:163] [“apply request took too long”] [took=326.211425ms] [expected-duration=100ms] [prefix=] [request=“header:<ID:1224288430602737588 username:"pd.server.cn" auth_revision:1 > txn:<compare:<> success:<request_put:<key:"/pd/7283055360825016244/raft/s/00000000000000000001" value_size:169 >> failure:<>>”] [response=size:20]
[2023/11/16 17:18:10.124 +08:00] [INFO] [trace.go:152] [“trace[826103753] linearizableReadLoop”] [detail=“{readStateIndex:26720194; appliedIndex:26720194; }”] [duration=235.440112ms] [start=2023/11/16 17:18:09.889 +08:00] [end=2023/11/16 17:18:10.124 +08:00] [steps=“["trace[826103753] ‘read index received’ (duration: 235.436211ms)","trace[826103753] ‘applied index is now lower than readState.Index’ (duration: 3.217µs)"]”]
[2023/11/16 17:18:10.125 +08:00] [WARN] [util.go:163] [“apply request took too long”] [took=235.622061ms] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/pd/7283055360825016244/config" “] [response=“range_response_count:1 size:3843”]
[2023/11/16 17:18:10.125 +08:00] [INFO] [trace.go:152] [“trace[2097090256] range”] [detail=”{range_begin:/pd/7283055360825016244/config; range_end:; response_count:1; response_revision:26718928; }”] [duration=235.715696ms] [start=2023/11/16 17:18:09.889 +08:00] [end=2023/11/16 17:18:10.125 +08:00] [steps=”["trace[2097090256] ‘agreement among raft nodes before linearized reading’ (duration: 235.584921ms)"]”]
[2023/11/16 17:18:10.552 +08:00] [WARN] [util.go:163] [“apply request took too long”] [took=115.088823ms] [expected-duration=100ms] [prefix=] [request=“header:<ID:1224288430602737898 username:"pd.server.cn" auth_revision:1 > txn:<compare:<> success:<request_put:<key:"/pd/7283055360825016244/raft/s/00000000000000000001" value_size:169 >> failure:<>>”] [response=size:20]

| username: tidb菜鸟一只 | Original post link

First, check if there are any issues with the network.

| username: 月明星稀 | Original post link

No issues with the network.

| username: dba远航 | Original post link

It feels like the write location has become read-only, not allowing writes.

| username: WalterWj | Original post link

Feeling +1

| username: Fly-bird | Original post link

How about trying to restart the KV node?

| username: 大飞哥online | Original post link

The logs show that an application request took longer than the expected duration, causing a delay in processing the request.

  1. Check the network connection;
  2. Check the load situation.
| username: 大飞哥online | Original post link

Check the TiDB logs, slow query logs, and error logs as well.

| username: 月明星稀 | Original post link

Restarting KV also cannot recover.

| username: 月明星稀 | Original post link

After cleaning the data, it can be restored. Not sure what the reason is???

| username: 月明星稀 | Original post link

When there is an issue, the TiKV and PD processes appear as “up” status when checked with TiUP, but data cannot be read or written.

| username: 像风一样的男子 | Original post link

Is there no error log? It feels like there is a problem with the disk.

| username: Jolyne | Original post link

It seems that the “read-only range” error is due to a disk hardware issue. Try creating a file or directory in the corresponding directory to check.

| username: TiDBer_小阿飞 | Original post link

What does this sentence mean? Is there a consistency issue between the Leader and Follower?

| username: TiDBer_小阿飞 | Original post link

This is obviously high latency.
You can check the following points:

  1. Network: Use common commands like ping/traceroute/mtr, ethtool, ifconfig/ip, netstat, tcpdump network analysis tools, etc., to test network connectivity and latency, and check if there are any errors like packet loss, especially packet loss.
  2. Disk I/O: Monitor the latency data of parameters related to WAL, such as disk_wal_fsync, and check the P99 latency.
  3. Expensive request: Check if there are large packet requests or situations involving a large number of key traversals.
  4. Capacity bottleneck: Too many write requests leading to degraded performance of linear read requests, etc.
  5. Node configuration: CPU busy causing request processing delays, insufficient memory, etc.
| username: swino | Original post link

Check network latency and disk read/write speed.

| username: db_user | Original post link

Check the disk usage rate. When it reaches a certain level, a write stall phenomenon may occur.