Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.Original topic: 单服务器多组件部署,突然内存打满,导致一直重启,内存打满,又重启

What are the causes of these anomalies? How can this problem be resolved?
Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.Original topic: 单服务器多组件部署,突然内存打满,导致一直重启,内存打满,又重启
What are the causes of these anomalies? How can this problem be resolved?
Use tiup cluster display tidb-xxx
to check the current status of the cluster components.
It is estimated that the OOM killer has been continuously killing.
The parameters for hybrid deployment need to be adjusted. The parameters mentioned in the documentation need to be recalculated and adjusted.
Use the top
command to check memory usage, press Shift+M to sort by memory, and observe which components are using a lot of memory.
@CatLover @tidbNewbie
global:
2 user: tidb
3 ssh_port: 22
4 ssh_type: builtin
5 deploy_dir: /opt/soft/tidb/tidb-deploy
6 data_dir: /opt/soft/tidb/tidb-data
7 os: linux
8 monitored:
9 node_exporter_port: 9100
10 blackbox_exporter_port: 9115
11 deploy_dir: /opt/soft/tidb/tidb-deploy/monitor-9100
12 data_dir: /opt/soft/tidb/tidb-data/monitor-9100
13 log_dir: /opt/soft/tidb/tidb-deploy/monitor-9100/log
14 server_configs:
15 tidb:
16 log.file.max-backups: 7
17 log.level: error
18 log.slow-threshold: 300
19 tikv:
20 log.file.max-backups: 7
21 log.file.max-days: 7
22 log.level: error
23 raftstore.capacity: 10G
24 readpool.coprocessor.use-unified-pool: true
25 readpool.storage.use-unified-pool: true
26 storage.block-cache.capacity: 5G
27 storage.block-cache.shared: true
28 pd:
29 log.file.max-backups: 7
30 log.file.max-days: 1
31 log.level: error
32 replication.enable-placement-rules: true
33 replication.location-labels:
34 - host
35 tidb_dashboard: {}
36 tiflash:
37 logger.level: info
38 tiflash-learner: {}
39 pump: {}
40 drainer: {}
41 cdc: {}
42 kvcdc: {}
43 grafana: {}
44 tidb_servers:
45 - host: 192.168.103.30
46 ssh_port: 22
47 port: 4000
48 status_port: 10080
49 deploy_dir: /opt/soft/tidb/tidb-deploy/tidb-4000
50 log_dir: /opt/soft/tidb/tidb-deploy/tidb-4000/log
51 arch: amd64
52 os: linux
53 tikv_servers:
54 - host: 192.168.103.30
55 ssh_port: 22
56 port: 20160
57 status_port: 20180
58 deploy_dir: /opt/soft/tidb/tidb-deploy/tikv-20160
59 data_dir: /opt/soft/tidb/tidb-data/tikv-20160
log_dir: /opt/soft/tidb/tidb-deploy/tikv-20160/log
61 config:
62 server.labels:
63 host: logic-host-1
64 arch: amd64
65 os: linux
66 - host: 192.168.103.30
67 ssh_port: 22
68 port: 20161
69 status_port: 20181
70 deploy_dir: /opt/soft/tidb/tidb-deploy/tikv-20161
71 data_dir: /opt/soft/tidb/tidb-data/tikv-20161
72 log_dir: /opt/soft/tidb/tidb-deploy/tikv-20161/log
73 config:
74 server.labels:
75 host: logic-host-1
76 arch: amd64
77 os: linux
78 - host: 192.168.103.30
79 ssh_port: 22
80 port: 20162
81 status_port: 20182
82 deploy_dir: /opt/soft/tidb/tidb-deploy/tikv-20162
83 data_dir: /opt/soft/tidb/tidb-data/tikv-20162
84 log_dir: /opt/soft/tidb/tidb-deploy/tikv-20162/log
85 config:
86 server.labels:
87 host: logic-host-1
88 arch: amd64
89 os: linux
90 tiflash_servers: []
91 pd_servers:
92 - host: 192.168.103.30
93 ssh_port: 22
94 name: pd-192.168.103.30-2379
95 client_port: 2379
96 peer_port: 2380
97 deploy_dir: /opt/soft/tidb/tidb-deploy/pd-2379
98 data_dir: /opt/soft/tidb/tidb-data/pd-2379
99 log_dir: /opt/soft/tidb/tidb-deploy/pd-2379/log
100 arch: amd64
101 os: linux
102 monitoring_servers:
103 - host: 192.168.103.30
104 ssh_port: 22
105 port: 9090
106 ng_port: 12020
107 deploy_dir: /opt/soft/tidb/tidb-deploy/prometheus-9090
108 data_dir: /opt/soft/tidb/tidb-data/prometheus-9090
109 log_dir: /opt/soft/tidb/tidb-deploy/prometheus-9090/log
110 external_alertmanagers: []
111 arch: amd64
112 os: linux
113 grafana_servers:
114 - host: 192.168.103.30
115 ssh_port: 22
116 port: 3000
117 deploy_dir: /opt/soft/tidb/tidb-deploy/grafana-3000
118 arch: amd64
119 os: linux
120 username: admin
121 password: admin
122 anonymous_enable: false
123 root_url: ""
124 domain: ""
Is the machine’s memory 32GB? The storage.block-cache.capacity of 5GB might be a bit too large. Try reducing it a bit.
It seems that due to the repeated startup of the kv nodes, the configuration cannot be reloaded after modification.
First, set tikv config storage.block-cache.capacity=3G
to prevent it from OOM. If you set it to 5G, each TiKV can use up to 12G of total memory. With 32G of memory, it won’t be enough and will keep killing TiKV.
Single-machine multi-node hybrid deployment can refer to this document to set resource limits.
Huh? Even after setting it according to the documentation, it still can’t be restricted? Is it only the new version that can achieve perfect resource isolation?
The new version has soft control and introduces the concept of a resource pool. The usage of all resources will not exceed the upper limit of the resource pool, which will be more effective. However, the difficulty of control and ease of use are still not sufficient, so we can only wait.