P99 latency extremely high, sometimes 502, 522, and often timeouts

protelo

PROOP

12 days ago

You can see my original reply before I was told to make my own thread.

This post appears to be yet another user experiencing the same thing.

See the original reply for more details. I'm getting 502, 522, and timeouts sporadically across all my services in both US-EAST and US-WEST.

Here's another example

Service c12a6666-d31e-4b09-86f7-e5920bc542ac - Cloudflare Proxy

Attachments

image.png

Awaiting Conductor Response$20 Bounty

2 Replies

Railway

BOT

12 days ago

This thread has been opened as a bounty so the community can help solve it.

Status changed to Open Railway • 12 days ago

haiderali11866-ctrl

FREE

8 days ago

I've been seeing the same behavior across multiple services and regions (US-East and US-West), and after digging into it, I'm not convinced this is application-related.

The pattern is pretty consistent:

Requests are normally fast.

Suddenly latency jumps into the 20–40 second range.

Checks start failing with either:

522 Connection Timed Out

502 Bad Gateway

client-side timeouts (~48s)

A minute later the exact same endpoint returns 200 OK again without any deployment or restart.

What's interesting is that these failures occur on services with very different workloads, which makes it harder to attribute to application code.

A few things worth checking:

Compare uptime checks against the Railway-provided domain and the Cloudflare-proxied domain. If both fail at the same time, Cloudflare is likely just reporting an upstream issue rather than causing it.

Look at Railway metrics during the spikes. In my case, the latency increases don't always correlate with CPU or memory pressure.

Check logs at the exact failure timestamps. If no request ever reaches the container during a 522 event, that points more toward the network/proxy path than the application itself.

Verify database pools aren't being exhausted. Connection starvation can sometimes look similar, although I'd expect more application-level errors if that were the primary cause.

The graph attached shows response times climbing to tens of seconds before failures occur, then recovering immediately afterward. Since the service often returns to normal without intervention, it feels more like an intermittent routing/proxy issue than a crashing application.

Has anyone from Railway been able to correlate these reports with edge proxy metrics or networking events? There seem to be multiple users reporting very similar symptoms recently.

protelo

PROOP

3 hours ago

My last high latency event was June 16th. Seems to have been much better since then. Seems their fixes (https://station.railway.com/questions/p99-response-latency-spikes-to-20-27s-1ca821d3#e79t) have resolved this.

Welcome!