Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.Original topic: 5.0.1 tikv-server hung住,节点先状态从 disconnect → down

[TiDB Usage Environment] Production Environment
[TiDB Version] 5.0.1
[Encountered Issues: Problem Phenomenon and Impact]
- tikv-server. CPU usage increased, business volume increased. However, overall QPS is not high.
Alert Details: tikv-02 CPU usage greater than 90% (current usage: 92.60833334177732%)
Alert Time: 2023-03-10 14:07:52
-
Checking traffic situation:
Found that QPS is not very high
-
Cluster tikv load started to rise
Alert Details: tikv-04 CPU usage greater than 80% (current usage: 86.6583333350718%)
Alert Time: 2023-03-10 14:12:52
Checking tikv information, found tikv hung.
- Logging into the faulty tikv machine, server status is normal (running)
[root@tikv-02 log]# systemctl -l|grep tikv
tikv-20160.service loaded active running tikv service
systemctl status tikv-20160.service
- Checking cluster status
5.1 Checking cluster status: disconnected (screenshot not retained)
5.2 Trying telnettelnet tikv-02 20160
stuck
5.3 Trying to start tikv, checking cluster status still down
tiup cluster start cluster_1 -N tikv-02
5.4 tiup restart tikv node, restart timeout stuck
tiup cluster restart cluster_1 -N tikv-02
-
Restored after
systemctl restart tikv-20160.service
.
-
Collected logs found:
- tikv.log
[2023/03/10 14:10:13.187 +08:00] [INFO] [tracker.rs:248] [slow-query] [perf_stats.internal_delete_skipped_count=7479] [perf_stats.internal_key_skipped_count=43869] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=126282] [scan.range.first="Some(start: 7480000000000001135F72800000006D92D512 end: 7480000000000001135F72800000006D92D513)"] [scan.ranges=6442] [scan.total=27874] [scan.processed=13921] [scan.is_desc=false] [tag=select] [table_id=275] [txn_start_ts=439989950792859665] [total_suspend_time=3.247s] [total_process_time=483ms] [handler_build_time=1ms] [wait_time.snapshot=0ns] [wait_time.schedule=2ms] [wait_time=2ms] [total_lifetime=3.733s] [remote_host=ipv4: 192.168.8.183:5668] [region_id=43066374]
[2023/03/10 14:10:13.211 +08:00] [INFO] [tracker.rs:248] [slow-query] [perf_stats.internal_delete_skipped_count=0] [perf_stats.internal_key_skipped_count=3730] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=46639] [scan.range.first="Some(start: 7480000000000001135F72800000006C12E805 end: 7480000000000001135F72800000006C12E806)"] [scan.ranges=2546] [scan.total=4822] [scan.processed=4000] [scan.is_desc=false] [tag=select] [table_id=275] [txn_start_ts=439989960098971676] [total_suspend_time=3.218s] [total_process_time=164ms] [handler_build_time=1ms] [wait_time.snapshot=0ns] [wait_time.schedule=2ms] [wait_time=2ms] [total_lifetime=3.385s] [remote_host=ipv4: 192.168.8.120:61611] [region_id=42864517]
[2023/03/10 14:10:13.247 +08:00] [INFO] [tracker.rs:248] [slow-query] [perf_stats.internal_delete_skipped_count=0] [perf_stats.internal_key_skipped_count=9141] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=62716] [scan.range.first="Some(start: 7480000000000001135F72800000006F32B37C end: 7480000000000001135F72800000006F32B381)"] [scan.ranges=3170] [scan.total=8565] [scan.processed=6916] [scan.is_desc=false] [tag=select] [table_id=275] [txn_start_ts=439989935169077294] [total_suspend_time=3.223s] [total_process_time=292ms] [handler_build_time=1ms] [wait_time.snapshot=0ns] [wait_time.schedule=1ms] [wait_time=1ms] [total_lifetime=3.517s] [remote_host=ipv4: 192.168.8.120:61607] [region_id=43326515]
[2023/03/10 14:10:13.247 +08:00] [INFO] [tracker.rs:248] [slow-query] [perf_stats.internal_delete_skipped_count=0] [perf_stats.internal_key_skipped_count=4046] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=26988] [scan.range.first="Some(start: 7480000000000001135F7280000000721C3FA0 end: 7480000000000001135F7280000000721C3FA3)"] [scan.ranges=1565] [scan.total=3756] [scan.processed=3420] [scan.is_desc=false] [tag=select] [table_id=275] [txn_start_ts=439989943190683668] [total_suspend_time=3.25s] [total_process_time=98ms] [handler_build_time=1ms] [wait_time.snapshot=0ns] [wait_time.schedule=3ms] [wait_time=3ms] [total_lifetime=3.352s] [remote_host=ipv4: 192.168.8.120:61609] [region_id=43579051]
[2023/03/10 14:10:13.256 +08:00] [INFO] [tracker.rs:248] [slow-query] [perf_stats.internal_delete_skipped_count=0] [perf_stats.internal_key_skipped_count=18283] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=163988] [scan.range.first="Some(start: 7480000000000001135F728000000071FCB7D9 end: 7480000000000001135F728000000071FCB7DA)"] [scan.ranges=9455] [scan.total=19658] [scan.processed=17535] [scan.is_desc=false] [tag=select] [table_id=275] [txn_start_ts=439989981581672464] [total_suspend_time=3.257s] [total_process_time=611ms] [handler_build_time=1ms] [wait_time.snapshot=0ns] [wait_time.schedule=12ms] [wait_time=12ms] [total_lifetime=3.881s] [remote_host=ipv4: 192.168.8.120:61613] [region_id=43573861]
[2023/03/10 14:10:13.256 +08:00] [INFO] [tracker.rs:248] [slow-query] [perf_stats.internal_delete_skipped_count=0] [perf_stats.internal_key_skipped_count=10332] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=98] [scan.range.first="Some(start: 7480000000000001135F72800000006C3D5883 end: 7480000000000001135F72800000006C3D6147)"] [scan.ranges=2] [scan.total=5169] [scan.processed=5167] [scan.is_desc=false] [tag=select] [table_id=275] [txn_start_ts=439989960098971676] [total_suspend_time=3.171s] [total_process_time=9ms] [handler_build_time=0ns] [wait_time.snapshot=0ns] [wait_time.schedule=56ms] [wait_time=56ms] [total_lifetime=3.236s] [remote_host=ipv4: 192.168.8.120:61611] [region_id=42880285]
[2023/03/10 14:10:13.256 +08:00] [INFO] [tracker.rs:248] [slow-query] [perf_stats.internal_delete_skipped_count=40] [perf_stats.internal_key_skipped_count=864] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=13914] [scan.range.first="Some(start: 7480000000000001135F72800000006E94BDEF end: 7480000000000001135F72800000006E94BDF0)"] [scan.ranges=804] [scan.total=1306] [scan.processed=1126] [scan.is_desc=false] [tag=select] [table_id=275] [txn_start_ts=439989940870709252] [total_suspend_time=3.168s] [total_process_time=50ms] [handler_build_time=1ms] [wait_time.snapshot=0ns] [wait_time.schedule=34ms] [wait_time=34ms] [total_lifetime=3.253s] [remote_host=ipv4: 192.168.8.120:61609] [region_id=43206949]
....
...
[2023/03/10 14:11:54.630 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:20160: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:11:54.797 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:20160: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:11:55.091 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:20160: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:03.627 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:20160: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.402 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:5666: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.402 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:5667: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.404 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:5668: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.407 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:61609: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.408 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:61611: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.409 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:61607: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.410 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:61613: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.426 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:21566: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.428 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:52567: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.428 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:1897: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.430 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:30194: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.430 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:33917: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.431 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:5665: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.433 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:22966: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.434 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:49120: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:08.165 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:20160: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:08.165 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:20160: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:13.165 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:20160: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:23.165 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:2379: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:13:16.490 +08:00] [INFO] [util.rs:536] ["connecting to PD endpoint"] [endpoints=http://172.16.25.179:2379]
- tidb_err.log
<jemalloc>: Malformed conf string
<jemalloc>: Malformed conf string
<jemalloc>: Malformed conf string
<jemalloc>: Malformed conf string
Considering that jemalloc is related to memory, it was found that tikv had a slight increase in memory usage around 14:10, but no OOM occurred.
Question:
- Hope experts can help explain why tikv hung and how to troubleshoot.
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]