Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: 通过 TidbNGMonitoring CRD 部署的 ngMonitoring 存储爆满
[TiDB Usage Environment] Production Environment
[TiDB Version] 6.1.0
[Reproduction Path]
Deploy the TiDB cluster through the official Operator and create TidbNGMonitoring with the following configuration:
apiVersion: pingcap.com/v1alpha1
kind: TidbNGMonitoring
metadata:
name: basicai
spec:
clusters:
- name: basicai
namespace: tidb-cluster
nodeSelector:
dedicated: infra
ngMonitoring:
requests:
storage: 50Gi
version: v6.1.0
storageClassName: alicloud-disk-tidb-monitor
baseImage: harbor.ba.....m/bf/ng-monitoring
Describing the ng-monitoring pod, you can see its running parameters:
Command:
/bin/sh
-c
/ng-monitoring-server \
--pd.endpoints basicai-pd.tidb-cluster:2379 \
--advertise-address ${POD_NAME}.basicai-ng-monitoring.tidb-cluster:12020 \
--config /etc/ng-monitoring/ng-monitoring.toml \
--storage.path /var/lib/ng-monitoring
[Encountered Problem: Phenomenon and Impact]
- Problem: The ng-monitor pod quickly exhausts the storage volume and keeps restarting due to the lack of available storage space.
- Impact: The ng-monitor cannot be used normally.
How can I control the data retention of the ng-monitor? I noticed that the official example does not allocate particularly large storage space.
I deleted TidbNGMonitoring and also deleted the PVC, then redeployed. I found that within just a few minutes, the /var/lib/ng-monitoring/docdb directory occupied 112.4M and continued to grow at this rate.
50GB might be a bit too little, get some more.
I see that the official documentation only mentions 10G, and it doesn’t specify how large it should be or how to manage it. If it can’t be managed, I think even 1TB might not be enough.
Enable continuous performance analysis.
- Enter TiDB Dashboard, select Advanced Debugging > Profiling Instances > Continuous Profiling.
- Click Open Settings. On the right Settings page, turn on the switch under Enable Feature. Set the Retention Period or keep the default value.
I don’t know if the “Retention Period” option can control the data size, but the minimum value is 3 days.
Normally, it shouldn’t store a lot of things
. Are there many nodes?
I think you can try upgrading. Version 6.1.0 is too low; it’s best to upgrade to the latest version.
We are currently evaluating the upgrade to 7.5.1. The development and testing environments have already been upgraded, but the production environment upgrade will not happen so quickly.
Thank you for the guidance
. Based on this table, I calculated that according to the size of my cluster and the number of nodes, the NGmonitor disk should not exceed 40GB. I have configured it to 50GB. I will observe it again to see if it gets filled up again.
There are issues with older versions of ngmonitor where historical data is not cleaned up. It is recommended to upgrade. You can search for specific issues on GitHub, for example, I know of one: conprof consume too much disk space and gc doesn't release disk space. · Issue #120 · pingcap/ng-monitoring · GitHub
We can only upgrade the production environment after the upgrade evaluation of our development and testing environment has passed.