Two Nodes Keep Restarting Repeatedly During TiKV Scaling

| username: jingyesi3401

[TiDB Usage Environment] Production Environment, Testing

[TiDB Version] v5.1.0

[Encountered Problem: Problem Phenomenon and Impact] Currently, there are 6 TiKV servers in the production environment. We expanded by adding 13 more TiKV servers on the afternoon of the 26th. Everything was normal at that time. Later, to speed up the process, we adjusted some parameters. However, since yesterday, two nodes have been repeatedly restarting up until now.

  1. The parameters adjusted to speed up the process are as follows:
config set leader-schedule-limit 16 
config set region-schedule-limit 8192
config set max-pending-peer-count 64
config set max-snapshot-count 64
config set replica-schedule-limit 96
  1. The parameters after recovery are as follows:

[Attachments: Screenshots/Logs/Monitoring] The TiKV logs of the two abnormal nodes are as follows:
As of December 30, 2022
Currently, the 141 node service is normal, but the tikv.log reports the following error (Figure 1), and it is unclear if it has any impact. The 139 node service is down, but the tikv.log is continuously updating. I am preparing to manually start it to see.

  1. 141 tikv.log

  2. 139 tikv.log

| username: 会飞的土拨鼠 | Original post link

Is the number of CPU cores and the memory size of the expanded TiKV nodes the same?

| username: jingyesi3401 | Original post link

The same, 16-core CPU, 64GB memory.

| username: 裤衩儿飞上天 | Original post link

How is the disk performance of the restarted node? Check the IO situation.

| username: 会飞的土拨鼠 | Original post link

At this point, you need to analyze the resource usage of the restarted node, such as memory usage, CPU usage, disk I/O, and network bandwidth. Then, check the load status of the TiKV nodes using the TiDB Grafana dashboard.

| username: jingyesi3401 | Original post link

The old 6 TiKV nodes have the same disk configuration, all-flash storage.

| username: Minorli-PingCAP | Original post link

Use tiup check to verify if the operating system configuration of the newly added node passes.