Intermittent internal networking failures during sequential deployments

Anonymous

HOBBYOP

6 months ago

This summary was suggested by my AI overlord so I didn't have to type everything:

dennis

Environment:

- 4 services communicating via private networking (.railway.internal)

- Environment created late December 2025 (should have dual-stack IPv4/IPv6)

- Services: control-api → dependent-api-1 → dependent-api-2 → dependent-web-app (deployed in dependency order)

Problem:

Services up and running fine. Then when redeploying services in sequence, downstream services intermittently fail to connect to upstream services via their .railway.internal hostnames. The connection eventually succeeds after redeploying after a delay (5-30 minutes estimated) .

Example scenario:

1. Deploy control-api (succeeds, health check passes)

2. Deploy dependent-api-1 (initially fails to reach control-api.railway.internal, then succeeds later, many minutes....)

3. Continue with remaining services

Observations:

- The issue is transient - connections work after a delay (redeploying same service is then able to connect)

- Same configuration worked previously; no code changes required to resolve

- Health checks use IPv4 (binding to 0.0.0.0:8080)

- I attempted binding to :: for IPv6 but this broke the health check (health checker couldn't reach the app)

Suspected cause:

Internal DNS propagation delay when a service's container restarts with a new internal IP address.

Questions:

1. Is this DNS propagation delay expected behavior?

2. Is there a recommended deployment methodology for interdependent services to avoid this issue?

3. Is there any configuration we're missing to improve private networking reliability during deployments?

$10 Bounty

4 Replies

mattijsdp

PRO

4 months ago

I am seeing something similar I think but can't fully diagnose. I don't think it's DNS propagation delay as I tried creating a dummy route to see if the DNS resolved which it did.

efipee

HOBBY

4 months ago

Yeah this is expected behavior unfortunately. When a service redeploys it gets a new container/IP, and the internal DNS takes a bit to catch up. During that window, downstream services trying to resolve the hostname will either get a stale record or nothing.

For the DNS propagation question specifically — it's not just TTL expiry. Even after the new container is up and healthy, there's a short window before the DNS record is updated across the mesh. That's the 5-30 min gap you're seeing.

A few things that help:

Retry logic in the app is probably the most important one. If your services fail hard on startup when they can't reach an upstream host, they'll just die. If they retry with backoff for a couple minutes, they'll usually pick it up fine once DNS propagates. This alone fixes most of the pain.

Stagger your deploys — wait for the upstream service's healthcheck to pass, then add an extra 30-60s before triggering the downstream deploy. The healthcheck passing doesn't mean DNS has fully propagated yet.

On the IPv6 thing — don't try to bind to ::, the healthchecker is IPv4 only so you'll break it. Stick with 0.0.0.0.

Also worth noting: when upstream redeploys, any existing TCP connections from downstream to the old container get severed. So even if DNS is fine, you need connection retry/reconnect logic anyway for the ECONNRESET you'll get.

mattijsdp

PRO

4 months ago

My setup:

1. Webapp running a next.js server with bun

2. Some FastAPI server.

Scenario:

- both are asleep.

- I wake the webapp which in turn hits the API and wakes the API.

- The API wakes up quickly enough ~10s(as tested by both hitting its public endpoint and trying its private endpoint using railway ssh in the webapp).

- Even though the API wakes up very quickly, the webapp repeatedly tries to hit the API for 3mins with no success.

- Weirdly, accessing the public endpoint does work within a few seconds of the API waking up.

mattijsdp

PRO

4 months ago

Update: found the root cause and fix, in case it's helpful for someone else:

It was NOT a DNS or Railway networking issue. The problem was Next.js’s patched fetch() caching ConnectionRefused errors.

When the webapp’s first fetch() call to the API gets ConnectionRefused (because the API is still waking up), Next.js caches that error by URL. All subsequent retry attempts return the cached error in 0ms without ever making a real TCP connection — even after the API is fully up and serving requests.

The fix has two parts:

Pass cache: "no-store" to fetch() — this bypasses Next.js’s response cache, forcing every retry to make a real network call.
Pass a URL string, not a Request object — when fetch() receives a Request object as the first argument, the init’s cache flag is silently ignored. You must call fetch(url_string, { cache: "no-store" }), not fetch(new Request(url), { cache: "no-store" }). Not sure why this is tbh.

We confirmed this by observing that retry attempts were completing in 0ms (cached) while node:http.get()to the same endpoint returned 200 immediately. Adding cache: "no-store" with a URL string brought cold-start time from 30s+ failures down to ~3.4s.

This likely affects anyone running Next.js with server-side fetch() calls to Railway private networking endpoints that may be temporarily unavailable (mostly cold starts I imagine). The default fetch caching behaviour silently prevents retries from working.

Welcome!