Logs showing info logs as errors for grafana agent
emilianoberonich
HOBBYOP

3 months ago

In the logs of my grafana agent, I'm seeing a lot of errors reported with errors (a red line) and displaying that the service failed. These are the logs:

ts=2025-11-23T23:03:59.658472811Z caller=reporter.go:38 level=info msg="running usage stats reporter"

ts=2025-11-23T23:03:59.662703315Z caller=wal.go:211 level=info agent=prometheus instance=91f035bae841b2d55355378392d4d29a msg="replaying WAL, this may take a while" dir=/tmp/agent/wal/91f035bae841b2d55355378392d4d29a/wal

ts=2025-11-23T23:03:59.663330445Z caller=wal.go:260 level=info agent=prometheus instance=91f035bae841b2d55355378392d4d29a msg="WAL segment loaded" segment=0 maxSegment=0

ts=2025-11-23T23:03:59.664494977Z caller=dedupe.go:112 agent=prometheus instance=91f035bae841b2d55355378392d4d29a component=remote level=info remote_name=91f035-dd0669 url=https://prometheus-prod-56-prod-us-east-2.grafana.net/api/prom/push msg="Starting WAL watcher" queue=91f035-dd0669

ts=2025-11-23T23:03:59.664519314Z caller=dedupe.go:112 agent=prometheus instance=91f035bae841b2d55355378392d4d29a component=remote level=info remote_name=91f035-dd0669 url=https://prometheus-prod-56-prod-us-east-2.grafana.net/api/prom/push msg="Starting scraped metadata watcher"

ts=2025-11-23T23:03:59.647168083Z caller=main.go:75 level=info boringcryptoenabled=false

ts=2025-11-23T23:03:59.65120906Z caller=node.go:85 level=info agent=prometheus component=cluster msg="applying config"

ts=2025-11-23T23:03:59.650650169Z caller=server.go:190 level=info msg="server listening on addresses" http=127.0.0.1:12345 grpc=127.0.0.1:12346 http_tls_enabled=false grpc_tls_enabled=false

ts=2025-11-23T23:03:59.651397633Z caller=remote.go:180 level=info agent=prometheus component=cluster msg="not watching the KV, none set"

ts=2025-11-23T23:03:59.652512657Z caller=zapadapter.go:78 level=info component=traces msg="Traces Logger Initialized"

ts=2025-11-23T23:03:59.664647134Z caller=dedupe.go:112 agent=prometheus instance=91f035bae841b2d55355378392d4d29a component=remote level=info remote_name=91f035-dd0669 url=https://prometheus-prod-56-prod-us-east-2.grafana.net/api/prom/push msg="Replaying WAL" queue=91f035-dd0669

ts=2025-11-23T23:04:59.667467551Z caller=dedupe.go:112 agent=prometheus instance=91f035bae841b2d55355378392d4d29a component=remote level=info remote_name=91f035-dd0669 url=https://prometheus-prod-56-prod-us-east-2.grafana.net/api/prom/push msg="Done replaying WAL" duration=1m0.002915157s

ts=2025-11-23T23:11:00.183902242Z caller=dedupe.go:112 agent=prometheus instance=91f035bae841b2d55355378392d4d29a component=remote level=info remote_name=91f035-dd0669 url=https://prometheus-prod-56-prod-us-east-2.grafana.net/api/prom/push msg="Retrying after duration specified by Retry-After header" duration=6s

ts=2025-11-23T23:11:06.184915732Z caller=dedupe.go:112 agent=prometheus instance=91f035bae841b2d55355378392d4d29a component=remote level=warn remote_name=91f035-dd0669 url=https://prometheus-prod-56-prod-us-east-2.grafana.net/api/prom/push msg="Failed to send batch, retrying" err="server returned HTTP status 429 Too Many Requests: the request has been rejected because the tenant exceeded the request rate limit, set to 75 requests/s across all distributors with a maximum allowed burst of 750 (err-mimir-tenant-max-request-rate). To adjust the related per-tenant limits, configure -distributor.request-rate-limit and -distributor.request-burst-size, or contact your service administrator."

ts=2025-11-23T23:34:00.17263403Z caller=dedupe.go:112 agent=prometheus instance=91f035bae841b2d55355378392d4d29a component=remote level=info remote_name=91f035-dd0669 url=https://prometheus-prod-56-prod-us-east-2.grafana.net/api/prom/push msg="Retrying after duration specified by Retry-After header" duration=8s

ts=2025-11-23T23:34:08.172780786Z caller=dedupe.go:112 agent=prometheus instance=91f035bae841b2d55355378392d4d29a component=remote level=warn remote_name=91f035-dd0669 url=https://prometheus-prod-56-prod-us-east-2.grafana.net/api/prom/push msg="Failed to send batch, retrying" err="server returned HTTP status 429 Too Many Requests: the request has been rejected because the tenant exceeded the request rate limit, set to 75 requests/s across all distributors with a maximum allowed burst of 750 (err-mimir-tenant-max-request-rate). To adjust the related per-tenant limits, configure -distributor.request-rate-limit and -distributor.request-burst-size, or contact your service administrator."

ts=2025-11-24T00:03:59.66707281Z caller=wal.go:490 level=info agent=prometheus instance=91f035bae841b2d55355378392d4d29a msg="series GC completed" duration=1.415645ms

ts=2025-11-24T01:03:59.668725656Z caller=wal.go:490 level=info agent=prometheus instance=91f035bae841b2d55355378392d4d29a msg="series GC completed" duration=968.603µs

ts=2025-11-24T02:03:59.670811892Z caller=wal.go:490 level=info agent=prometheus instance=91f035bae841b2d55355378392d4d29a msg="series GC completed" duration=949.695µs

Solved$10 Bounty

Pinned Solution

3 months ago

Hey! I see a WAL replay at the top followed by a bunch of remote_write requests with 429 HTTP statuses (too many requests) in the logs. From what I can tell, this means that your Grafana Agent service is:

1. restarting and re-applying the WAL (write-ahead-log) with the yet-to-be-sent metric records

2. sending too many metric records all at once to Grafana's cloud
3. failing/restarting because it can't get any logs out at all
4. moving back to step 1 in an infinite restart loop

To resolve this, I would try a few things in your Grafana Agent config:

- reduce the number of labels you're sending over. You can do that in your application OR use metric_relabel_configs in the metrics section of your agent
- reduce your scrape_interval: maybe move to 30s - 1m scrape intervals in the metrics section of your Grafana agent.
- I've never had to do this, but you can mess with your queue_config settings in the metrics section to setup a queue for sending out metrics.

TLDR: your service is sending too many messages at once to Grafana's cloud so they're denying your requests. To fix this, you have to reduce the number of records you're sending all at once somehow.

2 Replies

3 months ago

Hey! I see a WAL replay at the top followed by a bunch of remote_write requests with 429 HTTP statuses (too many requests) in the logs. From what I can tell, this means that your Grafana Agent service is:

1. restarting and re-applying the WAL (write-ahead-log) with the yet-to-be-sent metric records

2. sending too many metric records all at once to Grafana's cloud
3. failing/restarting because it can't get any logs out at all
4. moving back to step 1 in an infinite restart loop

To resolve this, I would try a few things in your Grafana Agent config:

- reduce the number of labels you're sending over. You can do that in your application OR use metric_relabel_configs in the metrics section of your agent
- reduce your scrape_interval: maybe move to 30s - 1m scrape intervals in the metrics section of your Grafana agent.
- I've never had to do this, but you can mess with your queue_config settings in the metrics section to setup a queue for sending out metrics.

TLDR: your service is sending too many messages at once to Grafana's cloud so they're denying your requests. To fix this, you have to reduce the number of records you're sending all at once somehow.


emilianoberonich
HOBBYOP

3 months ago

Thanks for the super quick answer. I'm just configuring the grafana agent and I don't have much experience with it, but I didn't see any error. Thanks again for the helpful answer. I'll review the problem.


Status changed to Solved brody 3 months ago


Loading...