TiKV Node Offline Causes Severe Cluster Jitter, Leading to OOM in Other TiKV Nodes

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv节点下线,集群抖动的厉害, 导致其他tikv节点oom

| username: lxzkenney

TiDB 5.0.3
Execute command:
tiup cluster scale-in tidb-fin-app --node xxx.xxx.xxx.xxx:20160

First, I scaled out two TiKV nodes (two TiKV instances on one machine). After scaling out, I prepared to remove two TiKV nodes (two TiKV instances on one machine). During the removal of the first node, the cluster experienced severe jitter, with 4-5 TiKV nodes encountering OOM and restarting. Additionally, this cluster has CDC nodes that synchronize tasks to Kafka. The CDC CPU and memory pressure are very high. The peak QPS of TiDB is around 4k to 16k. Previously, when other TiDB clusters had nodes taken offline, it was very smooth and unnoticeable, with no significant jitter. I don’t know why the jitter is so severe this time.

| username: Billmay表妹 | Original post link

This situation may be caused by data migration and scheduling in the TiDB cluster when taking a TiKV node offline. When a TiKV node is taken offline, the TiDB cluster automatically migrates the data on that node to other nodes. This process may increase the load on the TiDB cluster, leading to jitter in the TiDB cluster and OOM (Out of Memory) issues on the TiKV node.

To avoid this, it is recommended to migrate the data on the node to other nodes before taking the TiKV node offline. You can use the TiUP tool to perform the offline and data migration operations for the TiKV node. For specific operations, you can refer to the TiUP documentation.

Additionally, if the QPS (Queries Per Second) peak of the TiDB cluster reaches around 16,000, it is recommended to horizontally scale the TiDB cluster by increasing the number of TiDB nodes to improve the performance and stability of the TiDB cluster. At the same time, you can also consider horizontally scaling the TiKV nodes by increasing the number of TiKV nodes to improve the performance and stability of the TiKV cluster.

| username: lxzkenney | Original post link

I used the normal tiup tool to scale in.

tiup cluster scale-in tidb-fin-app --node xxx.xxx.xxx.xxx:20160

I know where the problem is. I didn’t specify the time, so it was forcibly taken offline after the default 5 minutes.

| username: lxzkenney | Original post link

I have another question. When using tiup to take a node offline, it forces the node offline after a timeout of 300 seconds. In theory, if the leader hasn’t migrated completely, the remaining parts on this TiKV would be inaccessible. It should trigger a role switch, promoting the followers on other TiKVs to leaders. This action should be quick, and the disruption should be brief. However, my cluster experienced disruptions for 2 days until the node was completely offline.

This suggests two possible causes:

  1. The leader of the region failed, and the follower switched to the leader.
  2. The region replica was missing, and the replica was automatically replenished on the surviving TiKV.
| username: xfworld | Original post link

For nodes that are going offline, prioritize the leader eviction operation first, and then proceed to take them offline…

>> scheduler add evict-leader-scheduler 1                 
// Evict all Region leaders from store 1

Refer to this command


After the leader is evicted and the TiKV node goes offline, the replicas will be completed by other nodes (provided that there are still enough TiKV nodes…).

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.