ngMonitoring Storage Full When Deployed via TidbNGMonitoring CRD

translator_bot · June 20, 2024, 7:13pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 通过 TidbNGMonitoring CRD 部署的 ngMonitoring 存储爆满

| username: TiDBer_RywnG56h

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.1.0

[Reproduction Path]
Deploy the TiDB cluster through the official Operator and create TidbNGMonitoring with the following configuration:

apiVersion: pingcap.com/v1alpha1
kind: TidbNGMonitoring
metadata:
  name: basicai
spec:
  clusters:
  - name: basicai
    namespace: tidb-cluster
  nodeSelector:
    dedicated: infra
  ngMonitoring:
    requests:
      storage: 50Gi
    version: v6.1.0
    storageClassName: alicloud-disk-tidb-monitor 
    baseImage: harbor.ba.....m/bf/ng-monitoring

Describing the ng-monitoring pod, you can see its running parameters:

Command:
      /bin/sh
      -c
      /ng-monitoring-server \
        --pd.endpoints basicai-pd.tidb-cluster:2379 \
        --advertise-address ${POD_NAME}.basicai-ng-monitoring.tidb-cluster:12020 \
        --config /etc/ng-monitoring/ng-monitoring.toml \
        --storage.path /var/lib/ng-monitoring

[Encountered Problem: Phenomenon and Impact]

Problem: The ng-monitor pod quickly exhausts the storage volume and keeps restarting due to the lack of available storage space.
Impact: The ng-monitor cannot be used normally.

How can I control the data retention of the ng-monitor? I noticed that the official example does not allocate particularly large storage space.

translator_bot · June 20, 2024, 7:13pm

| username: TiDBer_RywnG56h | Original post link

I deleted TidbNGMonitoring and also deleted the PVC, then redeployed. I found that within just a few minutes, the /var/lib/ng-monitoring/docdb directory occupied 112.4M and continued to grow at this rate.

translator_bot · June 20, 2024, 7:13pm

| username: TiDBer_jYQINSnf | Original post link

50GB might be a bit too little, get some more.

translator_bot · June 20, 2024, 7:13pm

| username: TiDBer_RywnG56h | Original post link

I see that the official documentation only mentions 10G, and it doesn’t specify how large it should be or how to manage it. If it can’t be managed, I think even 1TB might not be enough.

translator_bot · June 20, 2024, 7:13pm

| username: TiDBer_RywnG56h | Original post link

Enable continuous performance analysis.

Enter TiDB Dashboard, select Advanced Debugging > Profiling Instances > Continuous Profiling.
Click Open Settings. On the right Settings page, turn on the switch under Enable Feature. Set the Retention Period or keep the default value.

I don’t know if the “Retention Period” option can control the data size, but the minimum value is 3 days.

translator_bot · June 20, 2024, 7:13pm

| username: WalterWj | Original post link

Normally, it shouldn’t store a lot of things . Are there many nodes?

I think you can try upgrading. Version 6.1.0 is too low; it’s best to upgrade to the latest version.

translator_bot · June 20, 2024, 7:13pm

| username: TiDBer_RywnG56h | Original post link

We are currently evaluating the upgrade to 7.5.1. The development and testing environments have already been upgraded, but the production environment upgrade will not happen so quickly.

translator_bot · June 20, 2024, 7:13pm

| username: yiduoyunQ | Original post link

translator_bot · June 20, 2024, 7:13pm

| username: TiDBer_RywnG56h | Original post link

Thank you for the guidance . Based on this table, I calculated that according to the size of my cluster and the number of nodes, the NGmonitor disk should not exceed 40GB. I have configured it to 50GB. I will observe it again to see if it gets filled up again.

translator_bot · June 20, 2024, 7:13pm

| username: yiduoyunQ | Original post link

There are issues with older versions of ngmonitor where historical data is not cleaned up. It is recommended to upgrade. You can search for specific issues on GitHub, for example, I know of one: conprof consume too much disk space and gc doesn't release disk space. · Issue #120 · pingcap/ng-monitoring · GitHub

translator_bot · June 20, 2024, 7:13pm

| username: TiDBer_RywnG56h | Original post link

We can only upgrade the production environment after the upgrade evaluation of our development and testing environment has passed.

translator_bot · June 20, 2024, 7:13pm

| username: xiaoqiao | Original post link

Learn it.