Intermittent 522 / “connection timed out” at the edge across all my services, ~weekly clusters

4 days ago

FWIW- AI wrote my post for me, just trying to be fast.

Hi — looking for help identifying a recurring edge/proxy-level connectivity issue. It is not isolated to one service; it hits every service across multiple separate projects in my account simultaneously, which makes me think it's the HTTP proxy / networking layer rather than my apps.

Symptom

  • Requests to my public Railway domains intermittently fail for 1–6 minutes, then recover on their own with no action from me.
  • Through Cloudflare (proxied) the failures show as HTTP 522 (“connection timed out to origin”).
  • Probing the Railway origin directly (no Cloudflare) the same windows show “Timeout — no headers received.”
  • So the TCP/HTTP layer at Railway's edge stops responding; it's not an app 5xx.

Why I'm confident it's not my app

  • During these windows the containers never restart (process uptime is 40h+) and they keep processing background jobs/cron on schedule — the app is alive and the event loop is running; the edge just isn't reachable.
  • The failures hit all my projects at the same second — multiple independent services/projects that share nothing except Railway. They cannot self-correlate to the same minute unless the shared layer is at fault.
  • I run a copy of the same codebase on another platform behind the same Cloudflare setup, and it shows zero of these events in the same periods. That rules out my code and my CDN.

Pattern (UTC) — tight clusters, e.g.:

  • 2026-06-06 19:07–19:11
  • 2026-06-07 00:29
  • 2026-06-07 02:01–02:04
  • 2026-06-07 02:21–02:22
  • 2026-06-08 02:52–02:56
  • 2026-06-08 10:18
  • 2026-06-08 11:27–11:30

There's a recurring ~02:00–03:00 UTC cluster most nights, plus daytime occurrences, so it's not purely a maintenance window.

Environment

  • Region: EU West
  • Plan: Pro
  • Services affected: frontend + backend (Node), each fronted by Cloudflare proxy, origin on Railway.
  • Your public status page shows green throughout (I realise it only reports widespread incidents).

Questions

  1. Are there known edge/HTTP-proxy or regional networking events in my region in these windows that wouldn't show on the public status page?
  2. Is there anything on my side (proxy timeouts, keep-alive, healthcheck config, replica count) that could make my origins briefly unreachable at the edge without the container restarting?
  3. What diagnostics would help — I can share project IDs, service IDs, and exact deployment IDs for any of the timestamps above.

Happy to provide private details. Thanks!

Awaiting User Response

12 Replies

Railway
BOT

4 days ago

We can see your services across multiple projects are all running healthy with no restarts during the windows you reported, which is consistent with an edge-layer issue rather than an application problem. To trace this on our side, please share the X-Railway-Request-Id header from any failed requests during these windows, as that lets us follow the request through our proxy infrastructure. If you can also capture the X-Railway-Edge header from both successful and failed requests, that helps narrow whether a specific edge location is involved.


Status changed to Awaiting User Response Railway 4 days ago


Railway

We can see your services across multiple projects are all running healthy with no restarts during the windows you reported, which is consistent with an edge-layer issue rather than an application problem. To trace this on our side, please share the `X-Railway-Request-Id` header from any failed requests during these windows, as that lets us follow the request through our proxy infrastructure. If you can also capture the `X-Railway-Edge` header from both successful and failed requests, that helps narrow whether a specific edge location is involved.

4 days ago

Thanks — that matches what I'm seeing. Details below.

Edge location: every request egresses a single edge, from the X-Railway-Edge header on successful requests:

x-railway-edge: railway/europe-west4-drams3a

Both my frontend and backend services resolve to europe-west4-drams3a, so if a specific edge node is flapping, that's the one to look at.

X-Railway-Request-Id (successful requests, just now):

api (api.essencecompetitions.co.uk/health): ySuplncXQeGuKYnJALIuDA

fe (essencecompetitions.co.uk/api/health): 4ceo_qlcRHmhDKdos_GTAg

On the failed requests: here's the key signal — during the failure windows the response carries no X-Railway-Request-Id at all. Through Cloudflare it's a clean HTTP 522 (CF “connection timed out to origin”), and probing the origin directly it's “timeout, no headers received.” In other words the request never reaches a Railway origin process to be

assigned a request id — the edge at europe-west4-drams3a simply doesn't respond for 1–6 minutes, then recovers. That absence is consistent with an edge/proxy-layer fault rather than anything in my container (which stays up with 40h+ uptime and keeps processing background jobs throughout).

I've now got a capture running against both origins that logs full response headers continuously, so the next time a window hits I can give you the exact X-Railway-Edge/X-Railway-Request-Id (or confirm their absence) plus precise timestamps. I can also share project IDs, service IDs, and deployment IDs privately for any of the windows I listed — just

let me know the best channel.


Status changed to Awaiting Railway Response Railway 4 days ago


dizzydes90
EMPLOYEE

4 days ago

Hi Kieron — thanks for the detailed write-up and the header captures, they're exactly what we needed.

We've verified on our side that your services are healthy throughout these windows — no restarts, no redeploys, and your containers keep serving and running cron — which lines up with what you're seeing. The key signal is the one you flagged: the failed requests carry no X-Railway-Request-Id at all, so they aren't reaching a proxy/origin process to be assigned one. Combined with your direct-to-origin probe also timing out (bypassing Cloudflare), this points squarely at our edge layer rather than your apps or your CDN.

This matches a class of edge-connectivity issue we're already actively investigating — brief windows where new connections to the edge fail to establish (surfacing as a Cloudflare 522 / "no headers" on a direct probe) before recovering on their own, which is exactly the 1–6 minute, self-clearing pattern you're describing. Because all of your services ingress through the same edge (europe-west4-drams3a), an event there shows up across every project at the same second — so it looks app-independent, and it is.

On your Q2 directly: there's nothing on your side — proxy timeouts, keep-alive, healthcheck config, or replica count — that would make a healthy, running container briefly unreachable at the edge like this. You don't need to change anything.

It would still genuinely help to keep your header-capture running. Next time a window hits, send us:

  • the exact UTC timestamp(s),
  • the X-Railway-Edge value, and
  • confirmation that X-Railway-Request-Id is absent on the failures,

so we can line your events up precisely against our edge telemetry for europe-west4-drams3a. Feel free to drop project/service/deployment IDs here too.

We're tracking this on our end and will follow up as we make progress. Appreciate your patience.


Status changed to Awaiting User Response Railway 4 days ago


dizzydes90

Hi Kieron — thanks for the detailed write-up and the header captures, they're exactly what we needed. We've verified on our side that your services are healthy throughout these windows — no restarts, no redeploys, and your containers keep serving and running cron — which lines up with what you're seeing. The key signal is the one you flagged: the failed requests carry no `X-Railway-Request-Id` at all, so they aren't reaching a proxy/origin process to be assigned one. Combined with your direct-to-origin probe also timing out (bypassing Cloudflare), this points squarely at our edge layer rather than your apps or your CDN. This matches a class of edge-connectivity issue we're **already actively investigating** — brief windows where new connections to the edge fail to establish (surfacing as a Cloudflare 522 / "no headers" on a direct probe) before recovering on their own, which is exactly the 1–6 minute, self-clearing pattern you're describing. Because all of your services ingress through the same edge (`europe-west4-drams3a`), an event there shows up across every project at the same second — so it looks app-independent, and it is. On your Q2 directly: there's nothing on your side — proxy timeouts, keep-alive, healthcheck config, or replica count — that would make a healthy, running container briefly unreachable at the edge like this. You don't need to change anything. It would still genuinely help to keep your header-capture running. Next time a window hits, send us: - the exact UTC timestamp(s), - the `X-Railway-Edge` value, and - confirmation that `X-Railway-Request-Id` is absent on the failures, so we can line your events up precisely against our edge telemetry for `europe-west4-drams3a`. Feel free to drop project/service/deployment IDs here too. We're tracking this on our end and will follow up as we make progress. Appreciate your patience.

4 days ago

thank you, here are some of my project/service IDs, all on europe-west4-drams3a:

Essence 17c48b90-8be8-44a3-ae4f-34146a0ad7ae │ c91e14e1-b22d-4a58-8bec-000cb98c2dc3 │ 2a18e841-b2c6-46e5-b5fa-79155b626597 │

Prizes Galore │ 470f1f4a-1984-4a8f-b6cb-d7e51b02e7f4 │ 3a119a92-a3d1-469a-9264-e3745783959c │ 978c8359-4988-482f-a59b-dd525b73235d │

Other projects that fail in the same windows (same edge), if it helps you scope blast radius:

mrwish 85666c1e-b374-4cde-974a-1606771e3e34

fullbore 9f57af75-3dda-4c9f-978f-0e840677a77e

lowcost b7614abb-3bc2-4a13-9b5a-ff5685d20806

playatom 61bcf3a8-fd74-4708-bc94-998e47ec4b5a

demo 1aa84591-503c-440e-98a6-1b352ed92d91

Recent windows where every one of these failed within the same ~90s (UTC):

2026-06-07 02:01–02:04

2026-06-07 02:21–02:22

2026-06-08 02:52–02:56

2026-06-08 10:18

2026-06-08 11:27–11:30

Successful-request baseline (so you can see the edge serving normally between windows):

  • X-Railway-Edge: railway/europe-west4-drams3a
  • X-Railway-Request-Id: ySuplncXQeGuKYnJALIuDA (Essence api), 4ceo_qlcRHmhDKdos_GTAg (Essence frontend)

I've got a header capture polling both origins continuously now, so as soon as the next window lands I'll reply with the exact UTC timestamp, the X-Railway-Edge value, and confirmation that X-Railway-Request-Id is absent on the failures. Thanks for chasing it down.


Status changed to Awaiting Railway Response Railway 4 days ago


dizzydes90

Hi Kieron — thanks for the detailed write-up and the header captures, they're exactly what we needed. We've verified on our side that your services are healthy throughout these windows — no restarts, no redeploys, and your containers keep serving and running cron — which lines up with what you're seeing. The key signal is the one you flagged: the failed requests carry no `X-Railway-Request-Id` at all, so they aren't reaching a proxy/origin process to be assigned one. Combined with your direct-to-origin probe also timing out (bypassing Cloudflare), this points squarely at our edge layer rather than your apps or your CDN. This matches a class of edge-connectivity issue we're **already actively investigating** — brief windows where new connections to the edge fail to establish (surfacing as a Cloudflare 522 / "no headers" on a direct probe) before recovering on their own, which is exactly the 1–6 minute, self-clearing pattern you're describing. Because all of your services ingress through the same edge (`europe-west4-drams3a`), an event there shows up across every project at the same second — so it looks app-independent, and it is. On your Q2 directly: there's nothing on your side — proxy timeouts, keep-alive, healthcheck config, or replica count — that would make a healthy, running container briefly unreachable at the edge like this. You don't need to change anything. It would still genuinely help to keep your header-capture running. Next time a window hits, send us: - the exact UTC timestamp(s), - the `X-Railway-Edge` value, and - confirmation that `X-Railway-Request-Id` is absent on the failures, so we can line your events up precisely against our edge telemetry for `europe-west4-drams3a`. Feel free to drop project/service/deployment IDs here too. We're tracking this on our end and will follow up as we make progress. Appreciate your patience.

4 days ago

Following up with the technical data you asked for.

Failed request carrying a Railway request id (the FE container was up and responded, but its server-side call to the backend hung ~5s and returned degraded):

2026-06-08T12:00:23Z  essencecompetitions.co.uk/api/health → HTTP 503
x-railway-request-id: iwqB49YsS9aZINGYjUJq2g
x-railway-edge:       railway/europe-west4-drams3a
x-hikari-trace:       ber1.mmj8

External monitoring incidents — confirmed from all four regions (EU, US, Asia, AU), 30s cadence. Essence (api/frontend services), last ~36h, all via edge europe-west4-drams3a:


2026-06-07 00:29:48  00:31:24   (1m36s)   FE 503

2026-06-07 02:02:42  02:06:47   (4m05s)   FE 503

2026-06-07 02:21:07  02:26:32   (5m25s)   FE 503

2026-06-08 02:52:06  02:57:29   (5m23s)   FE 503

2026-06-08 10:18:17  10:19:23   (1m06s)   FE 503

2026-06-08 11:29:11  11:31:55   (2m44s)   FE 503

The api monitor logs the same windows as "Timeout — no headers received" (request hung with no response headers inside a 10s check). Failures are binary — between windows connection is rock-steady (4-region connect ≈1.3ms, TLS ≈24ms, transfer ≈2ms); there's no latency creep before a window, it drops to a total stall and self-clears.

One signature worth highlighting — it looks like egress/hairpin, not just inbound: at 2026-06-08T12:00:23Z a direct external request to the backend origin succeeded (HTTP 200, 191ms) at the same moment the frontend container's outbound request to that same backend (out through Cloudflare → back into the edge) hung 5s and timed out. So in at least some windows, inbound to the origin is fine while a Railway container's outbound call to a sibling service through the edge stalls.

Project / service IDs (all on europe-west4-drams3a):

Essence project 17c48b90-8be8-44a3-ae4f-34146a0ad7ae

  • backend c91e14e1-b22d-4a58-8bec-000cb98c2dc3
  • frontend 2a18e841-b2c6-46e5-b5fa-79155b626597

Prizes Galore project 470f1f4a-1984-4a8f-b6cb-d7e51b02e7f4

  • backend 3a119a92-a3d1-469a-9264-e3745783959c
  • frontend 978c8359-4988-482f-a59b-dd525b73235d

Other projects that fail in the same windows on the same edge:


mrwish 85666c1e-b374-4cde-974a-1606771e3e34

fullbore 9f57af75-3dda-4c9f-978f-0e840677a77e

lowcost b7614abb-3bc2-4a13-9b5a-ff5685d20806

playatom 61bcf3a8-fd74-4708-bc94-998e47ec4b5a. 

Happy to pull deployment IDs for any specific timestamp if useful.


nealimekenna
PRO

3 days ago

I wanted to chime in that we are experiencing the exact same issues over the past few days. We noticed our Instatus health monitor throwing increasingly more and more downtime notifications without any errors or noticeable traffic on the Railway dashboard. Any health check would fail for between 1 and 4 minutes giving the following response without any headers attached:

{

"error": "UNKNOWN_ERROR",

"message": "ETIMEDOUT"

}

If there is any information I could provide from our side to help speed up fixing this issue, let me know. In the meantime we have relaxed our outage reporting to reduce false positives due to proxy/edge failing to receive traffic. Following are some observations from our own investigation:

  • No services/containers/projects report any actual downtime or errors correlating to the perceived outage windows.
  • Windows are typically small, lasting no more than a minute. Longest outlier we had was almost 4 minutes.
  • Outage is reported over multiple projects and containers at once at the exact same time, but never all containers.
  • Outage is usually resolved for all affected containers at the exact same time as well.

These observations are made based on single endpoint health checks automatically made and logged by Instatus.

We also have several private containers that ping health to Instatus via a cron, these never experience any of these outages.

It would be beneficial to have insight through the dashboard into routing and connection traffic relating to our projects if possible.


dizzydes90

Hi Kieron — thanks for the detailed write-up and the header captures, they're exactly what we needed. We've verified on our side that your services are healthy throughout these windows — no restarts, no redeploys, and your containers keep serving and running cron — which lines up with what you're seeing. The key signal is the one you flagged: the failed requests carry no `X-Railway-Request-Id` at all, so they aren't reaching a proxy/origin process to be assigned one. Combined with your direct-to-origin probe also timing out (bypassing Cloudflare), this points squarely at our edge layer rather than your apps or your CDN. This matches a class of edge-connectivity issue we're **already actively investigating** — brief windows where new connections to the edge fail to establish (surfacing as a Cloudflare 522 / "no headers" on a direct probe) before recovering on their own, which is exactly the 1–6 minute, self-clearing pattern you're describing. Because all of your services ingress through the same edge (`europe-west4-drams3a`), an event there shows up across every project at the same second — so it looks app-independent, and it is. On your Q2 directly: there's nothing on your side — proxy timeouts, keep-alive, healthcheck config, or replica count — that would make a healthy, running container briefly unreachable at the edge like this. You don't need to change anything. It would still genuinely help to keep your header-capture running. Next time a window hits, send us: - the exact UTC timestamp(s), - the `X-Railway-Edge` value, and - confirmation that `X-Railway-Request-Id` is absent on the failures, so we can line your events up precisely against our edge telemetry for `europe-west4-drams3a`. Feel free to drop project/service/deployment IDs here too. We're tracking this on our end and will follow up as we make progress. Appreciate your patience.

3 days ago

Just caught one live (window opened 2026-06-08 12:50:06 UTC, edge europe-west4-drams3a). Got a clean same-second comparison this time:


12:50:48  backend  (direct hit)   → 200   conn=15ms tls=30ms ttfb=58ms

12:50:48  frontend (/api/health)  → 503   conn=15ms tls=40ms ttfb=5109ms

Connect + TLS to the edge were instant for both, and the direct hit to the backend was totally fine (~58ms) — but at that exact moment the frontend container's outbound call to that same backend hung 5s and timed out. So it's the egress/hairpin path stalling, not inbound. Failed FE request IDs from this window if you want to trace them:


nvHMCKoVS9-yfu3A8Hk9cQ

zml2E08EQvi9Crd655So4g

OePUXBIWTUOwcl_07fhULg   (hung 7.3s)

Hit prizesgalore, essence, playatom and demo all in the same minute again.


dizzydes90
EMPLOYEE

3 days ago

Kieron — really useful narrowing, thank you. Your 12:50:48 capture is the key signal: a direct hit to the backend returned 200 in 58ms while, at the very same second, the frontend's call to that same backend hung ~5s and returned 503. On our side we can see your frontend reaches the backend over its public Cloudflare-fronted domain, so that call goes out to Cloudflare and back into our edge (a "hairpin") rather than staying inside Railway. The leg that fails during these windows is the return into the europe-west4-drams3a edge — which is exactly why a brief edge blip shows up as your frontend's outbound call stalling, even when direct inbound to the backend is healthy at the same moment.

Two parts to this:

1. Something you can do right now to remove the frontend 503s. For service-to-service calls within the same project (frontend → backend, and any sibling calls), use Railway's private network instead of the public domain — e.g. call http://backend.railway.internal:<port> rather than https://api.essencecompetitions.co.uk. That keeps the traffic entirely inside Railway, never touching Cloudflare or the public edge, so it sidesteps the hairpin completely and should clear those internal 503s even while we work the underlying issue. (It's consistent with what you — and another affected user — are seeing: private/internal pings don't fail during these windows.) Note it's http:// over the private network, and both services need to be in the same project/environment, which yours are.

2. The underlying edge issue. The brief europe-west4-drams3a windows that hit genuine external inbound traffic (the 522 / "no headers" cases) are a platform-side edge problem, not anything on your end. We're actively working on it and tracking it internally with the timestamps, request IDs, and multi-project blast radius you've provided — that detail is exactly what we needed, so please keep the captures coming as new windows land.

Private networking won't change the external-user 522s — that part is on us — but it will take your frontend→backend failures out of the picture immediately. Thanks again for the thorough diagnostics.


Status changed to Awaiting User Response Railway 3 days ago


3 days ago

Thanks - good call on the internal networking, I will do that. Not sure why I have it set to use the public domain anyway, some old scenario i cant remember, im sure!


Status changed to Awaiting Railway Response Railway 3 days ago


dizzydes90

Kieron — really useful narrowing, thank you. Your 12:50:48 capture is the key signal: a direct hit to the backend returned 200 in 58ms while, at the very same second, the frontend's call to that same backend hung ~5s and returned 503. On our side we can see your frontend reaches the backend over its public Cloudflare-fronted domain, so that call goes *out to Cloudflare and back into our edge* (a "hairpin") rather than staying inside Railway. The leg that fails during these windows is the return **into** the `europe-west4-drams3a` edge — which is exactly why a brief edge blip shows up as your frontend's outbound call stalling, even when direct inbound to the backend is healthy at the same moment. Two parts to this: **1. Something you can do right now to remove the frontend 503s.** For service-to-service calls within the same project (frontend → backend, and any sibling calls), use Railway's **private network** instead of the public domain — e.g. call `http://backend.railway.internal:<port>` rather than `https://api.essencecompetitions.co.uk`. That keeps the traffic entirely inside Railway, never touching Cloudflare or the public edge, so it sidesteps the hairpin completely and should clear those internal 503s even while we work the underlying issue. (It's consistent with what you — and another affected user — are seeing: private/internal pings don't fail during these windows.) Note it's `http://` over the private network, and both services need to be in the same project/environment, which yours are. **2. The underlying edge issue.** The brief `europe-west4-drams3a` windows that hit genuine external inbound traffic (the 522 / "no headers" cases) are a platform-side edge problem, not anything on your end. We're actively working on it and tracking it internally with the timestamps, request IDs, and multi-project blast radius you've provided — that detail is exactly what we needed, so please keep the captures coming as new windows land. Private networking won't change the external-user 522s — that part is on us — but it will take your frontend→backend failures out of the picture immediately. Thanks again for the thorough diagnostics.

3 days ago

Remembered why I am using the public domain... I have 351 client-side call sites across composables, so can't always use the internal address, not without refactor anyway.


protelo
PRO

3 days ago

Wanted to chime in that I'm seeing what I believe to be the same issue, across both US-EAST and US-WEST. Times are in US/Eastern.

Service c12a6666-d31e-4b09-86f7-e5920bc542ac - Behind Cloudflare - US-WEST

Fails with timeout or 522.

image.png

Service 1b349406-996a-492c-9751-c26304afa76d - Direct to Railway - US-EAST

Notice that the direct connection gets 502 or timeout rather than 522

image.png

Service cb92147c-9afd-485a-ad98-81b56625c63b - Behind Cloudflare - US WEST

image.png

I have several other services. All show similar issues, with timeouts / 522 or 502 occasionally occurring. The same uptime monitoring service also checks some services hosted on AWS and by a 3rd party provider, both have no issues. My uptime monitoring solution is hosted on Railway (US-WEST).

I previously posted about this in another thread on p99 latency issues.


2 days ago

Hey! For anyone else having this issue can you create your own thread with links to services and any headers/traces you're able to? Appreciate it!

As for the root, we have recently implemented some changes which should be fixing this now. If you're still running into this, can you let me know?


Status changed to Awaiting User Response Railway 2 days ago


Welcome!

Sign in to your Railway account to join the conversation.

Loading...