5.0.1 tikv-server hangs, node status changes from disconnect -> down

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 5.0.1 tikv-server hung住,节点先状态从 disconnect → down

| username: TiDBer_yyy

[TiDB Usage Environment] Production Environment
[TiDB Version] 5.0.1
[Encountered Issues: Problem Phenomenon and Impact]

  1. tikv-server. CPU usage increased, business volume increased. However, overall QPS is not high.
Alert Details: tikv-02 CPU usage greater than 90% (current usage: 92.60833334177732%)
Alert Time: 2023-03-10 14:07:52
  1. Checking traffic situation:
    Found that QPS is not very high

  2. Cluster tikv load started to rise

Alert Details: tikv-04 CPU usage greater than 80% (current usage: 86.6583333350718%)
Alert Time: 2023-03-10 14:12:52

Checking tikv information, found tikv hung.

  1. Logging into the faulty tikv machine, server status is normal (running)
[root@tikv-02 log]# systemctl -l|grep tikv
tikv-20160.service                                                             loaded active     running      tikv service

systemctl status tikv-20160.service

  1. Checking cluster status
    5.1 Checking cluster status: disconnected (screenshot not retained)

    5.2 Trying telnet telnet tikv-02 20160 stuck

5.3 Trying to start tikv, checking cluster status still down

tiup cluster start cluster_1 -N tikv-02

5.4 tiup restart tikv node, restart timeout stuck

tiup cluster restart  cluster_1 -N tikv-02

  1. Restored after systemctl restart tikv-20160.service.

  2. Collected logs found:

  • tikv.log
[2023/03/10 14:10:13.187 +08:00] [INFO] [tracker.rs:248] [slow-query] [perf_stats.internal_delete_skipped_count=7479] [perf_stats.internal_key_skipped_count=43869] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=126282] [scan.range.first="Some(start: 7480000000000001135F72800000006D92D512 end: 7480000000000001135F72800000006D92D513)"] [scan.ranges=6442] [scan.total=27874] [scan.processed=13921] [scan.is_desc=false] [tag=select] [table_id=275] [txn_start_ts=439989950792859665] [total_suspend_time=3.247s] [total_process_time=483ms] [handler_build_time=1ms] [wait_time.snapshot=0ns] [wait_time.schedule=2ms] [wait_time=2ms] [total_lifetime=3.733s] [remote_host=ipv4: 192.168.8.183:5668] [region_id=43066374]
[2023/03/10 14:10:13.211 +08:00] [INFO] [tracker.rs:248] [slow-query] [perf_stats.internal_delete_skipped_count=0] [perf_stats.internal_key_skipped_count=3730] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=46639] [scan.range.first="Some(start: 7480000000000001135F72800000006C12E805 end: 7480000000000001135F72800000006C12E806)"] [scan.ranges=2546] [scan.total=4822] [scan.processed=4000] [scan.is_desc=false] [tag=select] [table_id=275] [txn_start_ts=439989960098971676] [total_suspend_time=3.218s] [total_process_time=164ms] [handler_build_time=1ms] [wait_time.snapshot=0ns] [wait_time.schedule=2ms] [wait_time=2ms] [total_lifetime=3.385s] [remote_host=ipv4: 192.168.8.120:61611] [region_id=42864517]
[2023/03/10 14:10:13.247 +08:00] [INFO] [tracker.rs:248] [slow-query] [perf_stats.internal_delete_skipped_count=0] [perf_stats.internal_key_skipped_count=9141] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=62716] [scan.range.first="Some(start: 7480000000000001135F72800000006F32B37C end: 7480000000000001135F72800000006F32B381)"] [scan.ranges=3170] [scan.total=8565] [scan.processed=6916] [scan.is_desc=false] [tag=select] [table_id=275] [txn_start_ts=439989935169077294] [total_suspend_time=3.223s] [total_process_time=292ms] [handler_build_time=1ms] [wait_time.snapshot=0ns] [wait_time.schedule=1ms] [wait_time=1ms] [total_lifetime=3.517s] [remote_host=ipv4: 192.168.8.120:61607] [region_id=43326515]
[2023/03/10 14:10:13.247 +08:00] [INFO] [tracker.rs:248] [slow-query] [perf_stats.internal_delete_skipped_count=0] [perf_stats.internal_key_skipped_count=4046] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=26988] [scan.range.first="Some(start: 7480000000000001135F7280000000721C3FA0 end: 7480000000000001135F7280000000721C3FA3)"] [scan.ranges=1565] [scan.total=3756] [scan.processed=3420] [scan.is_desc=false] [tag=select] [table_id=275] [txn_start_ts=439989943190683668] [total_suspend_time=3.25s] [total_process_time=98ms] [handler_build_time=1ms] [wait_time.snapshot=0ns] [wait_time.schedule=3ms] [wait_time=3ms] [total_lifetime=3.352s] [remote_host=ipv4: 192.168.8.120:61609] [region_id=43579051]
[2023/03/10 14:10:13.256 +08:00] [INFO] [tracker.rs:248] [slow-query] [perf_stats.internal_delete_skipped_count=0] [perf_stats.internal_key_skipped_count=18283] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=163988] [scan.range.first="Some(start: 7480000000000001135F728000000071FCB7D9 end: 7480000000000001135F728000000071FCB7DA)"] [scan.ranges=9455] [scan.total=19658] [scan.processed=17535] [scan.is_desc=false] [tag=select] [table_id=275] [txn_start_ts=439989981581672464] [total_suspend_time=3.257s] [total_process_time=611ms] [handler_build_time=1ms] [wait_time.snapshot=0ns] [wait_time.schedule=12ms] [wait_time=12ms] [total_lifetime=3.881s] [remote_host=ipv4: 192.168.8.120:61613] [region_id=43573861]
[2023/03/10 14:10:13.256 +08:00] [INFO] [tracker.rs:248] [slow-query] [perf_stats.internal_delete_skipped_count=0] [perf_stats.internal_key_skipped_count=10332] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=98] [scan.range.first="Some(start: 7480000000000001135F72800000006C3D5883 end: 7480000000000001135F72800000006C3D6147)"] [scan.ranges=2] [scan.total=5169] [scan.processed=5167] [scan.is_desc=false] [tag=select] [table_id=275] [txn_start_ts=439989960098971676] [total_suspend_time=3.171s] [total_process_time=9ms] [handler_build_time=0ns] [wait_time.snapshot=0ns] [wait_time.schedule=56ms] [wait_time=56ms] [total_lifetime=3.236s] [remote_host=ipv4: 192.168.8.120:61611] [region_id=42880285]
[2023/03/10 14:10:13.256 +08:00] [INFO] [tracker.rs:248] [slow-query] [perf_stats.internal_delete_skipped_count=40] [perf_stats.internal_key_skipped_count=864] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=13914] [scan.range.first="Some(start: 7480000000000001135F72800000006E94BDEF end: 7480000000000001135F72800000006E94BDF0)"] [scan.ranges=804] [scan.total=1306] [scan.processed=1126] [scan.is_desc=false] [tag=select] [table_id=275] [txn_start_ts=439989940870709252] [total_suspend_time=3.168s] [total_process_time=50ms] [handler_build_time=1ms] [wait_time.snapshot=0ns] [wait_time.schedule=34ms] [wait_time=34ms] [total_lifetime=3.253s] [remote_host=ipv4: 192.168.8.120:61609] [region_id=43206949]
....
...
[2023/03/10 14:11:54.630 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:20160: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:11:54.797 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:20160: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:11:55.091 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:20160: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:03.627 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:20160: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.402 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:5666: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.402 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:5667: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.404 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:5668: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.407 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:61609: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.408 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:61611: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.409 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:61607: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.410 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:61613: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.426 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:21566: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.428 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:52567: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.428 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:1897: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.430 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:30194: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.430 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:33917: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.431 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:5665: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.433 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:22966: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.434 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:49120: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:08.165 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:20160: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:08.165 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:20160: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:13.165 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:20160: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:23.165 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:2379: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:13:16.490 +08:00] [INFO] [util.rs:536] ["connecting to PD endpoint"] [endpoints=http://172.16.25.179:2379]
  • tidb_err.log
<jemalloc>: Malformed conf string
<jemalloc>: Malformed conf string
<jemalloc>: Malformed conf string
<jemalloc>: Malformed conf string

Considering that jemalloc is related to memory, it was found that tikv had a slight increase in memory usage around 14:10, but no OOM occurred.

Question:

  1. Hope experts can help explain why tikv hung and how to troubleshoot.

[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

| username: xfworld | Original post link

  1. Did the overall load of the cluster increase together, but each node is balanced, or does one node have a higher load?
  2. Are there any changes in business operations that could lead to a decline in processing performance?

After the cluster recovery work, did this fault reoccur?

Looking at the updated release notes, both 5.0.5 and 5.0.6 have fixed a lot of bugs. I wonder if this is a known issue.

| username: TiDBer_yyy | Original post link

Information correction: The cluster version is 5.0.1. Apologies for the incorrect version.

  1. The node load is relatively balanced. The overall cluster load first increased, then the TiKV nodes hung.

  1. There was an increase in business volume, but the overall cluster QPS was not high. During the failure, there was no significant increase in traffic.
    Communication with the business: There was an increase in business volume, but the traffic growth was not significant. There was a backlog of messages, initially judged to be caused by a decrease in cluster performance due to the TiKV failure.

  2. No similar failures were found after recovery.

  3. No similar bugs were found in comparison. I will check further.

| username: xfworld | Original post link

Observe more and see if there is an IO bottleneck in TiKV.

| username: TiDBer_yyy | Original post link

There is indeed a situation where disk IO util increases without reason, and it seems that IO resources are not fully utilized (not reaching the bottleneck).

The CPU has a significant bottleneck, the machine is 8C:


| username: Lucien-卢西恩 | Original post link

I suggest referring to @xfworld’s advice and upgrading the version.

| username: TiDBer_yyy | Original post link

Hi teacher, could you help me with this OOM issue?
It has occurred multiple times recently, and I’m starting to doubt my life.

@xfworld teacher @Lucien-卢西恩 teacher, please take a look when you have time.

| username: TiDBer_yyy | Original post link

Teacher, how can we test SQL compatibility when upgrading versions?
Besides using the GitHub - innerr/workload-sim.tidb.ticat: Workload simulation automatic test on ticat tool to collect business traffic, are there other ways to verify business compatibility?

| username: 人如其名 | Original post link

Why not post the TiKV logs if TiKV is hanging?

| username: TiDBer_yyy | Original post link

Added to the first post.

| username: xfworld | Original post link

The improvement of OOM can only be optimized; there is no other good way.

Currently, only version 6.5.X (LTS) has the most effective optimization for OOM.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.