Intermittent 503s on frontend public domain for proxied /api/v1 routes

kyhoon

HOBBYOP

3 months ago

We’re seeing intermittent 503 responses on our dev frontend public domain when hitting proxied API routes.

Project: planar

Environment: dev

Affected service: frontend/web

Symptom:

Requests to the frontend public domain sometimes return 503, especially:

- /api/v1/auth/providers

This breaks our login page because it occasionally cannot load auth providers and shows:

“Authentication providers are currently unavailable.”

What we verified:

- Direct backend requests for the same route are healthy.

- From inside the frontend container, the same route is healthy.

- The issue appears only on the public frontend domain path.

Example request IDs:

- xzEF6wHcTn6CM9fLxtoGcA

- YVvvt6unSY646OsdwoOzXw

- JcqX8OkeTKSLxTDSxtoGcA

Example timestamps (UTC):

- 2026-03-26 09:18:48

- 2026-03-26 09:29:09

- 2026-03-26 09:30:10

Could you investigate whether there is an edge / routing issue affecting the frontend public domain in this environment?

$10 Bounty

2 Replies

Status changed to Open Railway • 3 months ago

andreahlert

PRO

3 months ago

Hi Kyhoon, thanks for reaching out with all the details.

The 503s only happening on the public domain while direct backend and internal container requests work fine points to an issue with how the frontend proxies API requests to the backend over Railway's private network.

The most likely cause is intermittent DNS resolution failure for the internal service URL (e.g. *.railway.internal). This is a known pattern on Railway's private networking, especially in dev environments.

Another possibility is auto-sleep being enabled on the dev environment, where the backend takes a moment to wake up and the frontend proxy times out before it responds.

A few questions to narrow it down:

How does the frontend proxy /api/v1/* to the backend? (Next.js rewrites, nginx, custom server?)
What internal URL is it using? (e.g. http://service.railway.internal:PORT or a Railway variable reference?)
Is auto-sleep enabled on this dev environment?
How many replicas are running for the frontend service?

If it is the internal DNS issue, a common workaround is adding retry logic to the proxy layer or switching to the public backend URL as the proxy target in dev.

How to do it:

1. Retry logic means configuring your proxy layer to retry the request once or twice before returning a 503 to the client. For example, in nginx you'd add proxy_next_upstream error timeout http_502 http_503 with proxy_next_upstream_tries

2. In Next.js rewrites there's no built-in retry, so you'd need a custom server or middleware that catches failed fetch calls and retries them.

Switching to the public backend URL means replacing http://backend.railway.internal:PORT with the backend's public Railway domain (e.g. https://backend-dev.up.railway.app) as the proxy target. This bypasses the internal DNS entirely. The tradeoff is slightly higher latency since traffic goes through the public edge instead of staying internal, but for a dev environment that's usually acceptable and eliminates the intermittent DNS failures.

I would love to help you more, if this workaround does not work. If you have more questions, please let me know.

andreahlert

Hi Kyhoon, thanks for reaching out with all the details. The 503s only happening on the public domain while direct backend and internal container requests work fine points to an issue with how the frontend proxies API requests to the backend over Railway's private network. The most likely cause is intermittent DNS resolution failure for the internal service URL (e.g. \*.railway.internal). This is a known pattern on Railway's private networking, especially in dev environments. Another possibility is auto-sleep being enabled on the dev environment, where the backend takes a moment to wake up and the frontend proxy times out before it responds. A few questions to narrow it down: * How does the frontend proxy /api/v1/\* to the backend? (Next.js rewrites, nginx, custom server?) * What internal URL is it using? (e.g. http://service.railway.internal:PORT or a Railway variable reference?) * Is auto-sleep enabled on this dev environment? * How many replicas are running for the frontend service? If it is the internal DNS issue, a common workaround is adding retry logic to the proxy layer or switching to the public backend URL as the proxy target in dev. How to do it: 1\. Retry logic means configuring your proxy layer to retry the request once or twice before returning a 503 to the client. For example, in nginx you'd add proxy\_next\_upstream error timeout http\_502 http\_503 with proxy\_next\_upstream\_tries 2\. In Next.js rewrites there's no built-in retry, so you'd need a custom server or middleware that catches failed fetch calls and retries them. Switching to the public backend URL means replacing <http://backend.railway.internal:PORT> with the backend's public Railway domain (e.g. <https://backend-dev.up.railway.app>) as the proxy target. This bypasses the internal DNS entirely. The tradeoff is slightly higher latency since traffic goes through the public edge instead of staying internal, but for a dev environment that's usually acceptable and eliminates the intermittent DNS failures. I would love to help you more, if this workaround does not work. If you have more questions, please let me know.

kyhoon

HOBBYOP

3 months ago

Thank you andreahlert, for your detailed response!

Yeah i was also thinking this may be due to DNS failures of the private networking - to answer your questions to narrow this down:

How does the frontend proxy /api/v1/* to the backend? (Next.js rewrites, nginx, custom server?)

Next.js rewrites, using a custom route handler

What internal URL is it using? (e.g. http://service.railway.internal:PORT or a Railway variable reference?)

it is using a Railway variable reference (RAILWAY_PUBLIC_DOMAIN, as you have suggested) to the backend service

Is auto-sleep enabled on this dev environment?

No, auto-sleep is not enabled for this frontend deployment. However, there are some other services within this dev environment that have auto-sleep enabled.

How many replicas are running for the frontend service?

I am using a single replica in the US West region

To unblock myself I have regenerated the public domain for this frontend service, and it seems to work at the moment :)

I still want to know what exactly happened here but this is now a low priority, thank you for looking into this!

Welcome!