Scale Down Pump

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 缩容Pump

| username: xiexin

【TiDB Usage Environment】Production Environment
【TiDB Version】v4.0.10
【Reproduction Path】tiup cluster scale-in xxx -N xxx.xx.xx.xx
【Encountered Problem: Problem Phenomenon and Impact】
Problem Description:
Using tiup cluster scale-in to shrink the pump, the command execution result shows successful shrinkage. However, when using display to check the cluster topology, it is found that the status of the two shrunk pump nodes is Pending Offline. Using binlogctl to check the pump node status, it is found that the status of the shrunk pump nodes is online, and MaxCommitTS and UpdateTime remain unchanged since the shrinkage time. Checking the pump’s running log, there are [WARN] logs with content [WARN] [server.go:868] [“Waiting for drainer to consume binlog”] [“Minimum Drainer MaxCommitTS”=437208609048232751] [“Need to reach maxCommitTS”=441964881535041564].

   Here is a summary of the operation timeline and results:

Shrink 2 pump nodes of the xxx cluster, hereafter referred to as pump1 and pump2

Management Machine 2023-06-05 18:50 tiup cluster scale-in xxx -N xxx.xx.xx.xx:8250,xxx.xx.xx.xx:8250
Command Result: Scaled cluster xxx in successfully

Check the xxx cluster topology

Management Machine 2023-06-05 18:51 tiup cluster display xxx

Command Result: pump1 and pump2 status is Pending offline

Check pump1 running log

Pump1 Server 2023-06-05 18:51 tail -100 /tidb/deploy/pump-8250/log/pump.log


Pump1 subsequently reports the following WARN
[2023/06/05 18:51:57.440 +08:00] [WARN] [server.go:868] [“Waiting for drainer to consume binlog”] [“Minimum Drainer MaxCommitTS”=437208609048232751] [“Need to reach maxCommitTS”=441964881535041564]
The timestamp 441964881535041564 corresponds to the time 2023-06-05 18:50:25.094 +0800 CST

Use binlogctl to check the status of the shrunk pump nodes

Management Machine 2023-06-05 18:51 /home/tidb/.tiup/components/ctl/v4.0.10/binlogctl -pd-urls=http://xxx.xx.xx.xx:2379 -cmd pumps|grep ‘xxx.xx.xx.xx|xxx.xx.xx.xx’
Command Result:
|[2023/06/05 18:51:00.155 +08:00] [INFO] [nodes.go:53] [“query node”] [type=pump] [node=“{NodeID: xxx.xx.xx.xx:8250, Addr: xxx.xx.xx.xx:8250, State: online, MaxCommitTS: 441964878140277275, UpdateTime: 2023-06-05 18:50:12 +0800 CST}”]
[2023/06/05 18:51:00.155 +08:00] [INFO] [nodes.go:53] [“query node”] [type=pump] [node=“{NodeID: xxx.xx.xx.xx:8250, Addr: xxx.xx.xx.xx:8250, State: online, MaxCommitTS: 441964878140277272, UpdateTime: 2023-06-05 18:50:11 +0800 CST}”]|binlogctl shows pump1’s MaxCommitTS: 441964878140277275

Pump1 instance timestamp 441964878140277275 corresponds to the time 2023-06-05 18:50:12.144 +0800 CST

Phenomenon Description:
The tiup operation shrinks 2 pumps (around 2023-06-05 18:50:00), the scale-in command returns a successful result, but when checking the cluster topology, the pump node status is Pending offline, indicating it is in the process of going offline.
Checking the running log of the shrunk pump1 node, the subsequent log content is [“Need to reach maxCommitTS”=441964881535041564], indicating that the pump1 node’s binlog needs to be consumed up to 441964881535041564 (2023-06-05 18:50:25.094) before it can be taken offline.
Using the binlogctl tool to check the pump status, pump1’s MaxCommitTS: 441964878140277275 (2023-06-05 18:50:12.144), indicating that pump1 should have been kicked out of the pump cluster and no more binlogs will be written to it, so the MaxCommitTS will not change. However, since MaxCommitTS is less than the log’s required consumption point 441964881535041564 (2023-06-05 18:50:25.094), the node will remain in the offline state. After waiting for 2 hours, pump1’s log still shows “Waiting for drainer to consume binlog”, and checking the pump status with binlogctl, the MaxCommitTS and UpdateTime of the offline pump instance do not change.

Personal Understanding:

  1. There is only one copy of the log in the pump, so during the shrinkage operation, all binlogs on the node need to be consumed before the service can be stopped and taken offline, otherwise, binlogs will be lost.
  2. During the shrinkage process, the pump node to be taken offline is first kicked out to ensure no new binlogs are written, then the binlogs on it are consumed, and finally, the pump is taken offline.

Questions about Shrinking Pump:

  1. Is it not possible to directly use the tiup tool to shrink the pump node?
  2. Is the correct way to shrink the pump node to use binlogctl offline-pump first, and then use tiup scale in?
| username: xiexin | Original post link

Supplement:
Using the method of first using binlogctl offline-pump and then tiup scale in also doesn’t work.

Commands executed and results:

  • Scale in pump node xxx.xx.xx.xx:8250
    binlogctl -pd-urls=http://xxx.xx.xx.xx:2379 -cmd offline-pump -node-id xxx.xx.xx.xx:8250
  • Execution result
    [2023/06/06 17:40:56.477 +08:00] [INFO] [nodes.go:123] [“Apply action on node success”] [action=close] [NodeID=xxx.xx.xx.xx:8250]
  • Check pump status
    binlogctl -pd-urls=http://xxx.xx.xx.xx:2379 -cmd pumps |grep xxx.xx.xx.xx
  • Execution result
    [2023/06/06 17:45:32.783 +08:00] [INFO] [nodes.go:53] [“query node”] [type=pump] [node=“{NodeID: xxx.xx.xx.xx:8250, Addr: xxx.xx.xx.xx:8250, State: online, MaxCommitTS: 441986437858984243, UpdateTime: 2023-06-06 17:40:55 +0800 CST}”]

Pump running logs (irrelevant logs removed)
[2023/06/06 17:40:26.532 +08:00] [INFO] [server.go:567] [“server info tick”] [writeBinlogCount=5254631854] [alivePullerCount=2] [MaxCommitTS=441986430138843378]
[2023/06/06 17:40:26.532 +08:00] [INFO] [server.go:548] [“use gc ts to detect drainer checkpoint”] [“gc ts”=441826942830247936]
[2023/06/06 17:40:26.533 +08:00] [ERROR] [server.go:639] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=xxx.xx.xx.xx:8249] [“gc ts”=441845817198247936] [“drainer checkpoint”=441511349769469954]
[2023/06/06 17:40:26.533 +08:00] [ERROR] [server.go:639] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=xxx.xxx.xx.xx:8249] [“gc ts”=441845817198247936] [“drainer checkpoint”=437208609048232751]
[2023/06/06 17:40:30.695 +08:00] [INFO] [server.go:596] [“send gc request to storage”] [“request gc ts”=441827886548910080]
[2023/06/06 17:40:30.695 +08:00] [INFO] [storage.go:695] [“Starting GC”] [ts=441827880781742080]
[2023/06/06 17:40:30.695 +08:00] [INFO] [storage.go:752] [“New LevelDB iterator created for GC”] [ts=441827880781742080] [start=0] [limit=441827880781742081]
[2023/06/06 17:40:30.695 +08:00] [INFO] [vlog.go:482] [“GC vlog”] [ts=441827880781742080]
[2023/06/06 17:40:30.695 +08:00] [INFO] [storage.go:701] [“Finish VLog GC”] [ts=441827880781742080]
[2023/06/06 17:40:31.444 +08:00] [INFO] [storage.go:791] [“has delete”] [“delete num”=102400]
[2023/06/06 17:40:31.605 +08:00] [INFO] [storage.go:791] [“has delete”] [“delete num”=204800]
[2023/06/06 17:40:31.738 +08:00] [INFO] [storage.go:791] [“has delete”] [“delete num”=307200]
[2023/06/06 17:40:31.909 +08:00] [INFO] [storage.go:791] [“has delete”] [“delete num”=409600]
[2023/06/06 17:40:32.033 +08:00] [INFO] [storage.go:791] [“has delete”] [“delete num”=512000]
[2023/06/06 17:40:32.155 +08:00] [INFO] [storage.go:791] [“has delete”] [“delete num”=614400]
[2023/06/06 17:40:32.359 +08:00] [INFO] [storage.go:791] [“has delete”] [“delete num”=716800]
[2023/06/06 17:40:32.483 +08:00] [INFO] [storage.go:791] [“has delete”] [“delete num”=819200]
[2023/06/06 17:40:32.607 +08:00] [INFO] [storage.go:791] [“has delete”] [“delete num”=921600]
[2023/06/06 17:40:32.799 +08:00] [INFO] [storage.go:791] [“has delete”] [“delete num”=1024000]
[2023/06/06 17:40:32.892 +08:00] [INFO] [storage.go:783] [“Finish KV GC”] [ts=441827880781742080] [“delete num”=1115135]
[2023/06/06 17:40:36.532 +08:00] [INFO] [server.go:567] [“server info tick”] [writeBinlogCount=5254636083] [alivePullerCount=2] [MaxCommitTS=441986432760283250]
[2023/06/06 17:40:46.532 +08:00] [INFO] [server.go:567] [“server info tick”] [writeBinlogCount=5254640366] [alivePullerCount=2] [MaxCommitTS=441986435381723699]
[Executing offline-pump command]
[2023/06/06 17:40:56.434 +08:00] [INFO] [server.go:786] [“receive action”] [nodeID=xxx.xx.xx.xx:8250] [action=close]
[2023/06/06 17:40:56.434 +08:00] [INFO] [server.go:809] [“pump’s state change to closing”] [nodeID=xxx.xx.xx.xx:8250]
[2023/06/06 17:40:56.434 +08:00] [INFO] [server.go:908] [“begin to close pump server”]
[2023/06/06 17:40:56.434 +08:00] [INFO] [server.go:541] [“detect drainer checkpoint routine exit”]
[2023/06/06 17:40:56.434 +08:00] [INFO] [server.go:564] [“printServerInfo exit”]
[2023/06/06 17:40:56.434 +08:00] [INFO] [server.go:517] [“genFakeBinlog exit”]
[2023/06/06 17:40:56.434 +08:00] [INFO] [node.go:236] [“Heartbeat goroutine exited”]
[2023/06/06 17:40:56.434 +08:00] [INFO] [server.go:582] [“gcBinlogFile exit”]
[2023/06/06 17:40:56.434 +08:00] [INFO] [server.go:917] [“background goroutines are stopped”]
[2023/06/06 17:40:56.434 +08:00] [INFO] [server.go:955] [“The max commit ts saved is less than expected commit ts”] [“max commit ts”=441986437976948822] [“expected commit ts”=441986437976948997]
[2023/06/06 17:40:56.443 +08:00] [WARN] [server.go:256] [“reject write binlog for not online state”] [state=closing]
[2023/06/06 17:40:56.445 +08:00] [WARN] [server.go:256] [“reject write binlog for not online state”] [state=closing]
[2023/06/06 17:40:56.446 +08:00] [WARN] [server.go:256] [“reject write binlog for not online state”] [state=closing]
[2023/06/06 17:40:32.892 +08:00] [INFO] [storage.go:783] [“Finish KV GC”] [ts=441827880781742080] [“delete num”=1115135]
[2023/06/06 17:40:36.530 +08:00] [INFO] [storage.go:384] [DBStats] [DBStats=“{"WriteDelayCount":0,"WriteDelayDuration":0,"WritePaused":false,"AliveSnapshots":0,"AliveIterators":0,"IOWrite":2462233661022,"IORead":2340245933251,"BlockCacheSize":8387624,"OpenedTablesCount":6,"LevelSizes":[87744451,2117234218],"LevelTablesCounts":[4,32],"LevelRead":[0,1979484705349],"LevelWrite":[157887247225,1821787228222],"LevelDurations":[3515638558788,80076360188282]}”]
[2023/06/06 17:40:36.532 +08:00] [INFO] [server.go:567] [“server info tick”] [writeBinlogCount=5254636083] [alivePullerCount=2] [MaxCommitTS=441986432760283250]
[2023/06/06 17:40:46.530 +08:00] [INFO] [storage.go:384] [DBStats] [DBStats=“{"WriteDelayCount":0,"WriteDelayDuration":0,"WritePaused":false,"AliveSnapshots":0,"AliveIterators":0,"IOWrite":2462233998683,"IORead":2340245933251,"BlockCacheSize":8387624,"OpenedTablesCount":6,"LevelSizes":[87744451,2117234218],"LevelTablesCounts":[4,32],"LevelRead":[0,1979484705349],"LevelWrite":[157887247225,1821787228222],"LevelDurations":[3515638558788,80076360188282]}”]
[2023/06/06 17:40:46.532 +08:00] [INFO] [server.go:567] [“server info tick”] [writeBinlogCount=5254640366] [alivePullerCount=2] [MaxCommitTS=441986435381723699]
[2023/06/06 17:40:56.434 +08:00] [INFO] [server.go:786] [“receive action”] [nodeID=xxx.xx.xx.xx:8250] [action=close]
[2023/06/06 17:40:56.434 +08:00] [INFO] [server.go:809] [“pump’s state change to closing”] [nodeID=xxx.xx.xx.xx:8250]
[2023/06/06 17:40:56.434 +08:00] [INFO] [server.go:908] [“begin to close pump server”]
[2023/06/06 17:40:56.434 +08:00] [INFO] [server.go:541] [“detect drainer checkpoint routine exit”]
[2023/06/06 17:40:56.434 +08:00] [INFO] [server.go:564] [“printServerInfo exit”]
[2023/06/06 17:40:56.434 +08:00] [INFO] [server.go:517] [“genFakeBinlog exit”]
[2023/06/06 17:40:56.434 +08:00] [INFO] [node.go:236] [“Heartbeat goroutine exited”]
[2023/06/06 17:40:56.434 +08:00] [INFO] [server.go:582] [“gcBinlogFile exit”]
[2023/06/06 17:40:56.434 +08:00] [INFO] [server.go:917] [“background goroutines are stopped”]
[2023/06/06 17:40:56.434 +08:00] [INFO] [server.go:955] [“The max commit ts saved is less than expected commit ts”] [“max commit ts”=441986437976948822] [“expected commit ts”=441986437976948997]
[2023/06/06 17:40:56.443 +08:00] [WARN] [server.go:256] [“reject write binlog for not online state”] [state=closing]
[2023/06/06 17:40:56.445 +08:00] [WARN] [server.go:256] [“reject write binlog for not online state”] [state=closing]
[2023/06/06 17:40:56.446 +08:00] [WARN] [server.go:256] [“reject write binlog for not online state”] [state=closing]
[2023/06/06 17:40:58.437 +08:00] [WARN] [server.go:868] [“Waiting for drainer to consume binlog”] [“Minimum Drainer MaxCommitTS”=437208609048232751] [“Need to reach maxCommitTS”=441986437976948997]
[2023/06/06 17:40:59.439 +08:00] [WARN] [server.go:868] [“Waiting for drainer to consume binlog”] [“Minimum Drainer MaxCommitTS”=437208609048232751] [“Need to reach maxCommitTS”=441986437976948997]
[2023/06/06 17:41:00.440 +08:00] [WARN] [server.go:868] [“Waiting for drainer to consume binlog”] [“Minimum Drainer MaxCommitTS”=437208609048232751] [“Need to reach maxCommitTS”=441986437976948997]
[2023/06/06 17:41:06.530 +08:00] [INFO] [storage.go:384] [DBStats] [DBStats=“{"WriteDelayCount":0,"WriteDelayDuration":0,"WritePaused":false,"AliveSnapshots":0,"AliveIterators":0,"IOWrite":2462234279219,"IORead":2340245933251,"BlockCacheSize":8387624,"OpenedTablesCount":6,"LevelSizes":[87744451,2117234218],"LevelTablesCounts":[4,32],"LevelRead":[0,1979484705349],"LevelWrite":[157887247225,1821787228222],"LevelDurations":[3515638558788,80076360188282]}”]
[2023/06/06 17:41:11.347 +08:00] [ERROR] [server.go:746] [“get pumps’ status failed”] [error=“context canceled”] [errorVerbose=“context canceled\ngithub.com/pingcap/errors.AddStack\n\t/home/tomwei7/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.Trace\n\t/home/tomwei7/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/juju_adaptor.go:15\ngithub.com/pingcap/tidb-binlog/pkg/etcd.(*Client).List\n\t/home/tomwei7/bilibili/repos/db-rd/tidb-binlog/pkg/etcd/etcd.go:164\ngithub.com/pingcap/tidb-binlog/pkg/node.(*EtcdRegistry).Nodes\n\t/home/tomwei7/bilibili/repos/db-rd/tidb-binlog/pkg/node/registry.go:75\ngithub.com/pingcap/tidb-binlog/pump.(*pumpNode).NodesStatus\n\t/home/tomwei7/bilibili/repos/db-rd/tidb-binlog/pump/node.go:228\ngithub.com/pingcap/tidb-binlog/pump.(*Server).PumpStatus\n\t/home/tomwei7/bilibili/repos/db-rd/tidb-binlog/pump/server.go:744\ngithub.com/pingcap/tidb-binlog/pump.(*Server).Status\n\t/home/tomwei7/bilibili/repos/db-rd/tidb-binlog/pump/server.go:683\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/go/src/net/http/server.go:2109\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\t/home/tomwei7/go/pkg/mod/github.com/gorilla/mux@v1.7.4/mux.go:210\nnet/http.(*ServeMux).ServeHTTP\n\t/usr/lib/go/src/net/http/server.go:2487\nnet/http.serverHandler.ServeHTTP\n\t/usr/lib/go/src/net/http/server.go:2947\nnet/http.(*conn).serve\n\t/usr/lib/go/src/net/http/server.go:1991\nruntime.goexit\n\t/usr/lib/go/src/runtime/asm_amd64.s:1594”]

| username: Billmay表妹 | Original post link

The tiup cluster scale-in command can be used to scale in nodes in a TiDB cluster, including TiDB, TiKV, PD, TiFlash, Pump, etc. Therefore, it is possible to scale in Pump nodes using the tiup cluster scale-in command.

| username: Billmay表妹 | Original post link

According to the official documentation, the correct steps to scale in a Pump node are:

  1. Use the binlogctl offline-pump command to take the Pump node offline. Wait for all Binlog Drainers to finish consuming the Binlog on that node before proceeding to the next step.
  2. Use the tiup cluster scale-in command to scale in the Pump node.

Therefore, you can try using the binlogctl offline-pump command to take the Pump node offline, wait for all Binlog Drainers to finish consuming the Binlog on that node, and then use the tiup cluster scale-in command to scale in the Pump node.

| username: Billmay表妹 | Original post link

Refer to these questions:

https://asktug.com/t/topic/274437

| username: srstack | Original post link

TiUP can be directly scaled in~

| username: songxuecheng | Original post link

tiup scale-in xx --force

| username: Anna | Original post link

tiup scale-in xx --force

| username: xiexin | Original post link

Whether using tiup cluster scale-in or binlogctl -cmd offline-pump, after executing these commands, the actual pump node is still waiting for the commitTS to reach a timestamp after the scale-in operation. This causes the pump to remain in a closing state and unable to prune. Although using binlogctl’s update-pump can update the status and eventually scale-in, the concern is whether binlog is lost, as it is a production environment and the binlog is consumed downstream.

| username: xiexin | Original post link

There was no error in scale-in, so there is no need to add the --force parameter.

| username: songxuecheng | Original post link

What is being consumed downstream?

| username: xiexin | Original post link

Canal