Prometheus Node Memory Surge Leading to OOM Occurrence

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: prometheus 节点内存出现暴涨,导致发生omm

| username: 孤独的狼

【TiDB Usage Environment】Production Environment
【TiDB Version】v4.0.9

【Reproduction Path】What operations were performed to cause the issue

【Encountered Issue: Problem Phenomenon and Impact】
Prometheus shows status as normal but cannot collect data

【Resource Configuration】
【Attachments: Screenshots/Logs/Monitoring】

drwxr-xr-x 3 tidb tidb 68 Jan 17 10:02 01G00ZE4JVR1DEFJVMTRDBV9P5
drwxr-xr-x 3 tidb tidb 68 Jan 17 10:02 01G06RTS5NFYSG50ZJ1KHZX7Y0
drwxr-xr-x 3 tidb tidb 68 Jan 17 10:02 01G0CJ7CM5VBB4C8J09ZWN84H2
drwxr-xr-x 3 tidb tidb 68 Jan 17 10:02 01G0JBKYDSVDXY7YH0BBG5942H
drwxr-xr-x 3 tidb tidb 88 Jan 17 10:02 01G0R50J9CTEBYNFG5A3M79ZWG
drwxr-xr-x 3 tidb tidb 68 Jan 17 10:02 01G0XYD67Y2SC71WM5MB69GN2N
drwxr-xr-x 3 tidb tidb 68 Jan 17 10:02 01G13QSS84BYSXDVRCJ9CEWQQW
drwxr-xr-x 3 tidb tidb 68 Jan 17 10:02 01G19H6D5FYNWN4F6P6VAMMK3K
drwxr-xr-x 3 tidb tidb 68 Jan 17 10:02 01G1FAK1KEN08QEYN0CHQK06FX
drwxr-xr-x 3 tidb tidb 68 Jan 17 10:02 01G1N3ZMT2279WZHM7NEGJSWFW
drwxr-xr-x 3 tidb tidb 68 Jan 17 10:02 01G1TXC87AHEZ4YY60V06CBTPB
drwxr-xr-x 3 tidb tidb 88 Jan 17 10:02 01G20PRVX46MK3ZQAEQMT96MHG
drwxr-xr-x 3 tidb tidb 68 Jan 17 10:02 01G26G5G4KE40K16Q0CSZM15VY
drwxr-xr-x 3 tidb tidb 88 Jan 17 10:02 01G28DYXFHG6VRDAA4X63P932X
drwxr-xr-x 3 tidb tidb 88 Jan 17 10:02 01G292HXZQF785Y8A0RWH9R7PM
drwxr-xr-x 3 tidb tidb 88 Jan 17 10:02 01G292J1Z8NH0P2EJXFA6797EQ
drwxr-xr-x 3 tidb tidb 68 Jan 17 10:02 01G29F0D2J3VMH0799WC8D5VMS
drwxr-xr-x 3 tidb tidb 68 Jan 17 10:02 01G29G9CCBNPWM6KQ1N44FXS3K
drwxr-xr-x 3 tidb tidb 33 Dec 14 00:00 01GM63EDAWGHA85R4ST7KM454V.tmp
drwxr-xr-x 3 tidb tidb 68 Jan 17 10:02 01GMDTV05ZPY8TKJZ2DGMYK1D3
drwxr-xr-x 3 tidb tidb 33 Dec 18 21:31 01GMJPY4ZQMH1QR5DGJ7QY9N4A.tmp
drwxr-xr-x 3 tidb tidb 68 Jan 17 10:02 01GMQ11XZF9GFHADYCXTDQD5W4
drwxr-xr-x 3 tidb tidb 68 Jan 17 10:02 01GN6KPH5WDV7DF4XWPYXFGE58
drwxr-xr-x 3 tidb tidb 68 Jan 17 10:02 01GNBZCD75QHVG9A95KVYNQX1M
drwxr-xr-x 3 tidb tidb 33 Jan 9 10:05 01GPA4DGP40PSJFXQVCKKWGE6F.tmp
drwxr-xr-x 3 tidb tidb 33 Jan 15 08:01 01GPSBNQRHK5W35F02KTXCZT4S.tmp
-rw-r–r-- 1 tidb tidb 0 Jul 7 2020 lock
drwxr-xr-x 3 tidb tidb 24576 Jan 15 07:44 wal

image

| username: 孤独的狼 | Original post link

Prometheus Logs
level=warn ts=2023-01-17T02:02:38.369248526Z caller=main.go:274 deprecation_notice=“‘storage.tsdb.retention’ flag is deprecated use ‘storage.tsdb.retention.time’ instead.”
level=info ts=2023-01-17T02:02:38.369346127Z caller=main.go:321 msg=“Starting Prometheus” version=“(version=2.8.1, branch=HEAD, revision=4d60eb36dcbed725fcac5b27018574118f12fffb)”
level=info ts=2023-01-17T02:02:38.369371412Z caller=main.go:322 build_context=“(go=go1.11.6, user=root@bfdd6a22a683, date=20190328-18:04:08)”
level=info ts=2023-01-17T02:02:38.369393031Z caller=main.go:323 host_details=“(Linux 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64 GZ006069Monitor (none))”
level=info ts=2023-01-17T02:02:38.369417405Z caller=main.go:324 fd_limits=“(soft=1000000, hard=1000000)”
level=info ts=2023-01-17T02:02:38.369435024Z caller=main.go:325 vm_limits=“(soft=unlimited, hard=unlimited)”
level=info ts=2023-01-17T02:02:38.370336222Z caller=main.go:640 msg=“Starting TSDB …”
level=info ts=2023-01-17T02:02:38.370387979Z caller=web.go:418 component=web msg=“Start listening for connections” address=:9090
level=info ts=2023-01-17T02:02:38.370877652Z caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1649095200000 maxt=1649289600000 ulid=01G00ZE4JVR1DEFJVMTRDBV9P5
level=info ts=2023-01-17T02:02:38.370975598Z caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1649289600000 maxt=1649484000000 ulid=01G06RTS5NFYSG50ZJ1KHZX7Y0
level=info ts=2023-01-17T02:02:38.371039848Z caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1649484000000 maxt=1649678400000 ulid=01G0CJ7CM5VBB4C8J09ZWN84H2
level=info ts=2023-01-17T02:02:38.371101029Z caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1649678400000 maxt=1649872800000 ulid=01G0JBKYDSVDXY7YH0BBG5942H
level=info ts=2023-01-17T02:02:38.371177277Z caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1649872800000 maxt=1650067200000 ulid=01G0R50J9CTEBYNFG5A3M79ZWG
level=info ts=2023-01-17T02:02:38.371236679Z caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1650067200000 maxt=1650261600000 ulid=01G0XYD67Y2SC71WM5MB69GN2N
level=info ts=2023-01-17T02:02:38.371302421Z caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1650261600000 maxt=1650456000000 ulid=01G13QSS84BYSXDVRCJ9CEWQQW
level=info ts=2023-01-17T02:02:38.371368633Z caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1650456000000 maxt=1650650400000 ulid=01G19H6D5FYNWN4F6P6VAMMK3K
level=info ts=2023-01-17T02:02:38.371430882Z caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1650650400000 maxt=1650844800000 ulid=01G1FAK1KEN08QEYN0CHQK06FX
level=info ts=2023-01-17T02:02:38.371493539Z caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1650844800000 maxt=1651039200000 ulid=01G1N3ZMT2279WZHM7NEGJSWFW
level=info ts=2023-01-17T02:02:38.371548941Z caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1651039200000 maxt=1651233600000 ulid=01G1TXC87AHEZ4YY60V06CBTPB
level=info ts=2023-01-17T02:02:38.37160945Z caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1651233600000 maxt=1651428000000 ulid=01G20PRVX46MK3ZQAEQMT96MHG
level=info ts=2023-01-17T02:02:38.371661472Z caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1651428000000 maxt=1651622400000 ulid=01G26G5G4KE40K16Q0CSZM15VY
level=info ts=2023-01-17T02:02:38.371709781Z caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1651622400000 maxt=1651687200000 ulid=01G28DYXFHG6VRDAA4X63P932X
level=info ts=2023-01-17T02:02:38.371737851Z caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1651708800000 maxt=1651716000000 ulid=01G292HXZQF785Y8A0RWH9R7PM
level=info ts=2023-01-17T02:02:38.371780131Z caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1651687200000 maxt=1651708800000 ulid=01G292J1Z8NH0P2EJXFA6797EQ
level=info ts=2023-01-17T02:02:38.371812556Z caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1651716000000 maxt=1651723200000 ulid=01G29F0D2J3VMH0799WC8D5VMS
level=info ts=2023-01-17T02:02:38.371845153Z caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1651723200000 maxt=1651730400000 ulid=01G29G9CCBNPWM6KQ1N44FXS3K
level=info ts=2023-01-17T02:02:38.371877454Z caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1670832000000 maxt=1670839200000 ulid=01GMDTV05ZPY8TKJZ2DGMYK1D3
level=info ts=2023-01-17T02:02:38.371910664Z caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1670839200000 maxt=1670846400000 ulid=01GMQ11XZF9GFHADYCXTDQD5W4
level=info ts=2023-01-17T02:02:38.371942495Z caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1670846400000 maxt=1670853600000 ulid=01GN6KPH5WDV7DF4XWPYXFGE58
level=info ts=2023-01-17T02:02:38.371974147Z caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1670853600000 maxt=1670860800000 ulid=01GNBZCD75QHVG9A95KVYNQX1M

^C
[root@GZ006069Monitor log]# tail -200f prometheus.log

| username: 孤独的狼 | Original post link

Logs collected by Grafana

t=2023-01-17T10:05:12+0800 lvl=eror msg=“Alert Rule Result Error” logger=alerting.evalContext ruleId=72 name=“server report failures alert” error=“Could not find datasource Data source not found” changing state to=alerting
t=2023-01-17T10:05:12+0800 lvl=eror msg=“Alert Rule Result Error” logger=alerting.evalContext ruleId=31 name=“TiKV channel full alert” error=“Could not find datasource Data source not found” changing state to=alerting
t=2023-01-17T10:05:12+0800 lvl=eror msg=“Alert Rule Result Error” logger=alerting.evalContext ruleId=35 name=“TiKV scheduler worker CPU alert” error=“Could not find datasource Data source not found” changing state to=alerting
t=2023-01-17T10:05:14+0800 lvl=eror msg=“Alert Rule Result Error” logger=alerting.evalContext ruleId=61 name=“TiKV Storage ReadPool CPU alert” error=“Could not find datasource Data source not found” changing state to=alerting
t=2023-01-17T10:05:14+0800 lvl=eror msg=“Alert Rule Result Error” logger=alerting.evalContext ruleId=60 name=“Async apply CPU alert” error=“Could not find datasource Data source not found” changing state to=alerting
t=2023-01-17T10:05:15+0800 lvl=eror msg=“Alert Rule Result Error” logger=alerting.evalContext ruleId=63 name=“TiKV gRPC poll CPU alert” error=“Could not find datasource Data source not found” changing state to=alerting
t=2023-01-17T10:05:15+0800 lvl=eror msg=“Alert Rule Result Error” logger=alerting.evalContext ruleId=22 name=“Lock Resolve OPS alert” error=“Could not find datasource Data source not found” changing state to=alerting
t=2023-01-17T10:05:16+0800 lvl=eror msg=“Alert Rule Result Error” logger=alerting.evalContext ruleId=48 name=“Transaction Retry Num alert” error=“Could not find datasource Data source not found” changing state to=alerting
t=2023-01-17T10:05:16+0800 lvl=eror msg=“Alert Rule Result Error” logger=alerting.evalContext ruleId=25 name=“gRPC poll CPU alert” error=“Could not find datasource Data source not found” changing state to=alerting
t=2023-01-17T10:05:16+0800 lvl=eror msg=“Alert Rule Result Error” logger=alerting.evalContext ruleId=40 name=“etcd disk fsync” error=“Could not find datasource Data source not found” changing state to=alerting
t=2023-01-17T10:05:17+0800 lvl=eror msg=“Alert Rule Result Error” logger=alerting.evalContext ruleId=33 name=“TiKV raft store CPU alert” error=“Could not find datasource Data source not found” changing state to=alerting
t=2023-01-17T10:05:20+0800 lvl=eror msg=“Alert Rule Result Error” logger=alerting.evalContext ruleId=57 name=“Append log duration alert” error=“Could not find datasource Data source not found” changing state to=alerting
t=2023-01-17T10:05:20+0800 lvl=eror msg=“Alert Rule Result Error” logger=alerting.evalContext ruleId=45 name=“Parse Duration alert” error=“Could not find datasource Data source not found” changing state to=alerting
t=2023-01-17T10:05:20+0800 lvl=eror msg=“Alert Rule Result Error” logger=alerting.evalContext ruleId=56 name=“TiKV raft store CPU alert” error=“Could not find datasource Data source not found” changing state to=alerting
t=2023-01-17T10:05:20+0800 lvl=eror msg=“Alert Rule Result Error” logger=alerting.evalContext ruleId=46 name=“Compile Duration alert” error=“Could not find datasource Data source not found” changing state to=alerting
t=2023-01-17T10:05:20+0800 lvl=eror msg=“Alert Rule Result Error” logger=alerting.evalContext ruleId=29 name=“Critical error alert” error=“Could not find datasource Data source not found” changing state to=alerting
t=2023-01-17T10:05:21+0800 lvl=info msg=“Shutdown started” logger=server reason=“System signal: terminated”
t=2023-01-17T10:05:21+0800 lvl=info msg=“Stopped NotificationService” logger=server reason=“context canceled”
t=2023-01-17T10:05:21+0800 lvl=info msg=“Stopped CleanUpService” logger=server reason=“context canceled”
t=2023-01-17T10:05:21+0800 lvl=info msg=“Stopped TracingService” logger=server reason=nil
t=2023-01-17T10:05:21+0800 lvl=info msg=“Stopped ProvisioningService” logger=server reason=“context canceled”
t=2023-01-17T10:05:21+0800 lvl=info msg=“Stopped AlertingService” logger=server reason=“context canceled”
t=2023-01-17T10:05:21+0800 lvl=info msg=“Stopped RemoteCache” logger=server reason=“context canceled”
t=2023-01-17T10:05:21+0800 lvl=info msg=“Stopped UsageStatsService” logger=server reason=“context canceled”
t=2023-01-17T10:05:21+0800 lvl=info msg=“Stopped Stream Manager”
t=2023-01-17T10:05:21+0800 lvl=info msg=“Stopped RenderingService” logger=server reason=nil
t=2023-01-17T10:05:21+0800 lvl=info msg=“Stopped UserAuthTokenService” logger=server reason=“context canceled”
t=2023-01-17T10:05:21+0800 lvl=info msg=“Stopped PluginManager” logger=server reason=“context canceled”
t=2023-01-17T10:05:21+0800 lvl=info msg=“Stopped InternalMetricsService” logger=server reason=“context canceled”
t=2023-01-17T10:05:21+0800 lvl=info msg=“Stopped HTTPServer” logger=server reason=nil
t=2023-01-17T10:05:21+0800 lvl=eror msg=“Server shutdown” logger=server reason=“System signal: terminated”

^C
[root@GZ006069Monitor logs]# free -g
total used free shared buff/cache available
Mem: 62 2 21 0 38 59
Swap: 7 0 7
[root@GZ006069Monitor logs]#
[root@GZ006069Monitor logs]# free -g
total used free shared buff/cache available
Mem: 62 2 21 0 38 59
Swap: 7 0 7
[root@GZ006069Monitor logs]#
[root@GZ006069Monitor logs]#
[root@GZ006069Monitor logs]# tail -f grafana.log

| username: 孤独的狼 | Original post link

Due to resource issues, the monitoring was turned off for a period of time. The historical monitoring data can be deleted. I just want to collect the latest monitoring data, so the previous monitoring data can be deleted.

| username: TiDBer_jYQINSnf | Original post link

Is Prometheus restarting due to a failure? Restarting after a failure requires recovering the WAL, which needs a considerable amount of memory. I’m not familiar with TiUP. If you don’t need the data, you might want to try deleting and recreating the monitoring. In Kubernetes, you usually delete the PV and the pod, and it will automatically restart.

| username: 裤衩儿飞上天 | Original post link

Deleting historical data makes it much simpler to directly scale in and scale out.

| username: DBRE | Original post link

Filter the Prometheus logs using the “Starting TSDB” keyword to see if Prometheus is continuously restarting. Additionally, you can add the following configuration to prometheus.yml to reduce data collection, then restart Prometheus using tiup cluster restart xxx -R prometheus. However, note that this method will cause prometheus.yml to be overwritten whenever the topology changes. For more details, refer to 专栏 - TiDB监控Prometheus磁盘内存问题 | TiDB 社区.

vim prometheus.yml

Find the section with job_name: “tikv” and add:

metric_relabel_configs:

  • source_labels: [name]
    separator: ;
    regex: tikv_thread_nonvoluntary_context_switches|tikv_thread_voluntary_context_switches|tikv_threads_io_bytes_total
    action: drop
  • source_labels: [name,name]
    separator: ;
    regex: tikv_thread_cpu_seconds_total;(tokio|rocksdb).+
    action: drop
| username: 孤独的狼 | Original post link

I think the problem is that the tidb-server process is not running. You can check if the process is running with the following command:

ps -ef | grep tidb-server

If it is not running, you can start it with the following command:

tidb-server --config=conf/tidb.toml
| username: 孤独的狼 | Original post link

Deleting is a bit tricky, try not to rebuild.

| username: 孤独的狼 | Original post link

How to specifically use scale-in and scale-out

| username: 孤独的狼 | Original post link

I added this, but it still doesn’t work. I suspect it’s an issue with historical data,
level=warn ts=2023-01-17T03:06:08.314093161Z caller=head.go:450 component=tsdb msg=“unknown series references” count=78760

^C
[root@GZ006069Monitor log]# tail -200f prometheus.log

Error: unknown series references

| username: 孤独的狼 | Original post link

The default value of tidb_dml_batch_size is 2000. You can adjust it according to your needs.

| username: DBRE | Original post link

This needs to be observed for a while, it won’t become normal immediately, including web API access. At this stage, there should be some operations in the Prometheus backend.

| username: 孤独的狼 | Original post link

I can’t wait too long, the memory consumption is too high. There are also forwarding services running on top. If the monitoring fails, it won’t affect the online services, but if the forwarding fails, it will affect the online services.

| username: 孤独的狼 | Original post link

The image is not visible. Please provide the text you need translated.

| username: 裤衩儿飞上天 | Original post link

tiup cluster scale-in [flags]

tiup cluster scale-out [topology.yaml] [flags]

| username: Jellybean | Original post link

Here are the ways to handle expired historical logs in Prometheus. It is recommended to verify in the test environment before executing in the production environment.

  1. Temporary Solution: You can refer to this post and give it a try: prometheus的监控数据可以删除么? - TiDB 的问答社区

Delete all data within a specific time range:

curl -X POST -g 'http://127.0.0.1:9090/api/v1/admin/tsdb/delete_series?match[]={**name** =~".+"}&start=2022-03-29T00:00:00Z&end=2022-03-30T00:00:00Z'
  1. Long-term Solution:
    Recommended: Use tiup cluster edit-config to change the storage_retention configuration of Prometheus to set the log retention duration, then tiup reload prometheus.

(Another way, not recommended but effective: Modify the -storage.tsdb.retention parameter value in the Prometheus startup script, then tiup reload prometheus.)

| username: DBRE | Original post link

You can first scale down and then scale up the Prometheus nodes, and then modify prometheus.yml.

| username: TiDBer_jYQINSnf | Original post link

Rebuilding the monitoring has no impact. Your monitoring has been off for so long that the data is already outdated. Brother Kucha has given you a method. If you still have questions, check the tiup documentation: TiUP 文档地图 | PingCAP 文档中心

| username: WalterWj | Original post link

Well, on one hand, it depends on the amount of data in Prometheus. Generally, data is stored for 30 days, but if there are many instances, the stored data will be larger.

On the other hand, it could be that when you were viewing Grafana, you pulled 30 days of data and looked at some complex, memory-intensive monitoring panels, which caused the issue.

The usual solutions are to reduce the retention period or add more memory.