TiKV failed to update max timestamp for region xx due to get timestamp timeout issue

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv failed to update max timestamp for region xx get timestamp timeout 问题

| username: Inkjade

[TiDB Usage Environment] Test Environment
[TiDB Version] 6.5.2
[Reproduction Path] TiDB cluster has no disk space, restart TiDB cluster
[Encountered Problem: Problem Phenomenon and Impact]
tikv.log continuously outputs the following logs: Cluster unavailable
[2024/03/31 23:10:28.240 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 261033: Pd(Other("[components/pd_client/src/client.rs:981]: get timestamp timeout"))”]
[2024/03/31 23:10:28.240 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 261035: Pd(Other("[components/pd_client/src/client.rs:981]: get timestamp timeout"))”]
[2024/03/31 23:10:28.240 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 261031: Pd(Other("[components/pd_client/src/client.rs:981]: get timestamp timeout"))”]
[2024/03/31 23:10:28.240 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 261029: Pd(Other("[components/pd_client/src/client.rs:981]: get timestamp timeout"))”]
[2024/03/31 23:10:28.240 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 261025: Pd(Other("[components/pd_client/src/client.rs:981]: get timestamp timeout"))”]
[2024/03/31 23:10:28.240 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 261027: Pd(Other("[components/pd_client/src/client.rs:981]: get timestamp timeout"))”]
[2024/03/31 23:10:28.240 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 261023: Pd(Other("[components/pd_client/src/client.rs:981]: get timestamp timeout"))”]
[2024/03/31 23:10:28.240 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 261021: Pd(Other("[components/pd_client/src/client.rs:981]: get timestamp timeout"))”]
[2024/03/31 23:10:28.240 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 261019: Pd(Other("[components/pd_client/src/client.rs:981]: get timestamp timeout"))”]
[2024/03/31 23:10:28.240 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 261017: Pd(Other("[components/pd_client/src/client.rs:981]: get timestamp timeout"))”]
[2024/03/31 23:10:28.240 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 261015: Pd(Other("[components/pd_client/src/client.rs:981]: get timestamp timeout"))”]
[2024/03/31 23:10:28.240 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 261013: Pd(Other("[components/pd_client/src/client.rs:981]: get timestamp timeout"))”]
[2024/03/31 23:10:28.240 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 261011: Pd(Other("[components/pd_client/src/client.rs:981]: get timestamp timeout"))”]
[2024/03/31 23:10:28.240 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 261009: Pd(Other("[components/pd_client/src/client.rs:981]: get timestamp timeout"))”]
[2024/03/31 23:10:28.240 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 261007: Pd(Other("[components/pd_client/src/client.rs:981]: get timestamp timeout"))”]

First: Clear tikv logs to free up disk space.

Attempt Method 1:
– Get machine id
cat /data/tidb-deploy/pd-2379/log/pd.log |grep ‘init cluster id’

[2024/03/31 13:25:37.607 +08:00] [INFO] [server.go:384] [“init cluster id”] [cluster-id=7166168149192488053]

– Get allocated ID
[webapp@lg-test-shuabao log]$ cat /data/tidb-deploy/pd-2379/log/pd.log| grep “idAllocator allocates a new id” | awk -F’=’ ‘{print $2}’ | awk -F’]’ ‘{print $1}’ | sort -r -n | head -n 1
1417000

Rebuild pd
./pd-recover -endpoints http://127.0.0.1:2379 -cluster-id 7166168149192488053 -alloc-id 1417000

Error still exists
[2024/03/31 23:10:28.240 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 261033: Pd(Other("[components/pd_client/src/client.rs:981]: get timestamp timeout"))”]

No exception found when getting tso

Attempt Method 2
Attempt to upgrade minor version from 6.5.2 to 6.5.8, restart cluster, problem still exists.
Start cluster: tikv node logs data crazily
[2024/03/31 23:10:28.240 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 261007: Pd(Other("[components/pd_client/src/client.rs:981]: get timestamp timeout"))”]

[Resource Configuration] Enter TiDB Dashboard - Cluster Info - Hosts, screenshot this page
[Attachment: Screenshot/Log/Monitoring]
Mixed deployment of tidb, pd, and tikv on the same virtual machine

| username: Billmay表妹 | Original post link

First, scale out 2 KV nodes to see if the cluster is functioning normally.

| username: Billmay表妹 | Original post link

How are the virtual machine resources?

| username: Inkjade | Original post link

Peak value: Peak CPU usage of 7 cores

| username: 像风一样的男子 | Original post link

What are the symptoms of cluster unavailability?

| username: Inkjade | Original post link

Because it is a mixed deployment, TiKV is frantically outputting logs, with a peak CPU usage of 700%. The current test machine has 12 cores, and the IOPS is very high, outputting about 4GB of log files every minute. The query latency is high; a simple SELECT statement takes nearly 2 seconds.

Log file screenshot:

Log content:
[2024/04/01 00:48:04.289 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 732777: Pd(Other("[components/pd_client/src/client.rs:981]: get timestamp timeout"))”]
[2024/04/01 00:48:04.289 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 747465: Pd(Other("[components/pd_client/src/client.rs:981]: get timestamp timeout"))”]
[2024/04/01 00:48:04.289 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 733273: Pd(Other("[components/pd_client/src/client.rs:981]: get timestamp timeout"))”]
[2024/04/01 00:48:04.289 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 741235: Pd(Other("[components/pd_client/src/client.rs:981]: get timestamp timeout"))”]
[2024/04/01 00:48:04.289 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 738057: Pd(Other("[components/pd_client/src/client.rs:981]: get timestamp timeout"))”]
[2024/04/01 00:48:04.289 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 741129: Pd(Other("[components/pd_client/src/client.rs:981]: get timestamp timeout"))”]
[2024/04/01 00:48:04.289 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 735223: Pd(Other("[components/pd_client/src/client.rs:981]: get timestamp timeout"))”]
[2024/04/01 00:48:04.289 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 719255: Pd(Other("[components/pd_client/src/client.rs:981]: get timestamp timeout"))”]
[2024/04/01 00:48:04.289 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 734127: Pd(Other("[components/pd_client/src/client.rs:981]: get timestamp timeout"))”]
[2024/04/01 00:48:04.289 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 734201: Pd(Other("[components/pd_client/src/client.rs:981]: get timestamp timeout"))”]

There is not much valuable log information.

| username: tidb菜鸟一只 | Original post link

For testing, just use tiup playground.

| username: TIDB-Learner | Original post link

Adjust the log level to error. Try not to output such large logs for now and see if the disk I/O has become a bottleneck.

| username: changpeng75 | Original post link

Are all the nodes synchronized?

| username: xiaoqiao | Original post link

Check if the various metrics in Grafana are normal and identify any anomalies.

| username: xiao | Original post link

Has the issue been resolved? I encountered the same problem.

| username: Inkjade | Original post link

Currently, the test database has been rebuilt!

| username: xiao | Original post link

Sure, thank you~

| username: mono | Original post link

It should not be a time synchronization issue. I am running the service on the same machine. I encountered a similar problem. Due to the testing environment, there is no monitoring. I suspect it was caused by a full disk. After cleaning up the logs, the cluster service started working.

[2024/06/12 17:51:10.428 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 15101: Pd(Other("[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed"))”]
[2024/06/12 17:51:10.428 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 15101: Pd(Other("[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed"))”]
[2024/06/12 17:51:10.428 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 15101: Pd(Other("[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed"))”]
[2024/06/12 17:51:10.428 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 15101: Pd(Other("[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed"))”]
[2024/06/12 17:51:10.428 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 15101: Pd(Other("[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed"))”]
[2024/06/12 17:51:10.428 +08:00] [WARN] [pd.rs:1707] [“failed to update max timestamp for region 15101: Pd(Other("[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed"))”]