TiDB node_exporter Collection Slow Timeout

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB node_exporter采集缓慢超时

| username: asd80703406

[TiDB Usage Environment] Production Environment
[TiDB Version] 5.3.0
[Reproduction Path] Mixed Environment
[Encountered Problem: Phenomenon and Impact]
Phenomenon: curl node1:9100/metrics takes 10-20s to collect, sometimes even longer.
Impact: Grafana monitoring lines are intermittent.
Cause Investigation:
By default, the startup parameters are: bin/node_exporter/node_exporter --web.listen-address=:9100 --collector.tcpstat --collector.systemd --collector.mountstats --collector.meminfo_numa --collector.interrupts --collector.buddyinfo --collector.vmstat.fields=^.* --log.level=info
From the phenomenon, it appears to be slow collection. Therefore, I tried adjusting the collector list for exclusion. Testing found that enabling --collector.systemd caused the collection to slow down significantly. Thus, the cause of the slowness was identified! It’s the collector.systemd that’s slow.
What I want to know is:

  1. What data does collector.systemd collect exactly? If disabled, what is the impact on monitoring?
  2. How can I further pinpoint why collecting systemd is slow? I tried starting node_exporter with strace -f -T -tt, but it didn’t clearly indicate where the slowness was; each call seemed quite fast…

[Resource Configuration] 32C 256G *3 machines
[Attachments: Screenshots/Logs/Monitoring] There are corporate watermarks, possibly confidential, so I won’t post screenshots… The attachment is my strace log
strace.log (41.3 MB)

time=“2024-01-24T19:07:43+08:00” level=error msg=“error encoding and sending metric family: write tcp IP1:9100->IP2:53648: write: broken pipe\n” source=“log.go:172”

Additionally, the machine’s dbus-daemon CPU utilization is 50%, while other nodes are at 1-2%.

| username: 有猫万事足 | Original post link

I saw that someone suggested adding a timeout for systemd collection in the node_exporter project. However, it hasn’t been changed yet. This issue is still open.

| username: asd80703406 | Original post link

The expert is right, I came across this link during troubleshooting, but there was no response. Could it be that others haven’t encountered this issue?

| username: asd80703406 | Original post link

@h5n1 @Billmay’s cousin, sincerely invite both of you experts to help analyze this.

| username: dba远航 | Original post link

Is it related to the log collection level?

| username: asd80703406 | Original post link

It doesn’t make sense. Other servers with the same configuration don’t have this issue.
For the metrics interface with a 20s response time, the details are as follows:
mount information 4286
ssh information 76514
Others:
1810 node_systemd_unit_state
800 node_interrupts_total
362 node_systemd_unit_start_time_seconds

| username: 有猫万事足 | Original post link

The code contains some logic that outputs how many seconds it took to execute. However, the log level is set to debug.

If it doesn’t work, you can only lower the log level to see exactly where the time is being consumed.

| username: zhang_2023 | Original post link

It seems to be related to the log level.

| username: asd80703406 | Original post link

I’ll take some time today to adjust the log level and will reply later!

| username: asd80703406 | Original post link

Hello everyone, happy new year. After the new year, I will supplement the error report.
After adjusting the node_export log level, the log prompts are as follows:
A large amount of the following information:
time=“2024-02-19T15:27:49+08:00” level=debug msg=“couldn’t get unit ‘sshd@30434801-IP address 1:22-IP address 2:49440.service’ NRestarts: Unknown property or interface.” source=“systemd_linux.go:256”
time=“2024-02-19T15:28:04+08:00” level=debug msg=“Adding unit: sshd@30448417-IP address 1:22-IP address 2:49440.service” source=“systemd_linux.go:325”

Final output time consumption:
DEBU[0099] OK: systemd collector succeeded after 79.567751s. source=“collector.go:135”

| username: asd80703406 | Original post link

The core issues now are:

  1. Why is there so much sshd@.service information?
  2. Where are the connections coming from?
  3. Even if there are connections, it shouldn’t block the collection.
| username: asd80703406 | Original post link

[root@node1 ~]# systemctl status
● node1
State: degraded
Jobs: 0 queued
Failed: 12742 units
Since: Fri 2023-04-28 10:19:12 CST; 9 months 23 days ago

| username: songxuecheng | Original post link

You can refer to this to see if it is a kernel issue.

The 5.4.0-135 kernel seems to have fixed this issue for me. More than 24 hours now without issue.

| username: asd80703406 | Original post link

Thank you!
I’ll take a look at this issue.

| username: asd80703406 | Original post link

We are currently still on
Linux node1 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

| username: songxuecheng | Original post link

This is due to network issues. You can check the network and the load status of this machine. What services are deployed on this machine?

| username: songxuecheng | Original post link

Please also take a screenshot of dbus-monitor --system.

| username: asd80703406 | Original post link

It is now basically confirmed that there is an excessive amount of sshd@ID-IP1-IP2:port.service information. When collect is used to gather this systemd information, there are over 10,000 failures, causing the collection to slow down.

| username: asd80703406 | Original post link

It has been resolved. The final command executed:
systemctl reset-failed to clean up failed services. This way, systemd no longer records this information, and the collection becomes faster.

| username: asd80703406 | Original post link

The root cause is that we have a machine that frequently SSHs into this node to execute shell commands to obtain certain monitoring information, resulting in a large number of SSHD-related failed states in systemd, which slowed down the data collection. The monitoring node was shut down before the end of the year, and we have been dealing with inspections since then, so we didn’t follow up on this issue. As of now, this issue is considered resolved. Thanks to everyone for your support!