Private networking completely broken between services after deployment - connection timeouts despite healthy services
florianpreusner
PROOP

4 months ago

Environment

  • Region: europe-west4
  • Runtime: V2
  • Metal Edge: Disabled

Problem

After deploying any service, inter-service communication via *.railway.internal times out. All services are healthy (receiving Railway health checks from 100.64.0.x), but cannot reach each other.

This is NOT DNS caching - DNS returns fresh IPs after redeploys, but connections still timeout.

Setup

  • service-a service-a.railway.internal 8000
  • service-b service-b.railway.internal 8001
  • service-c service-c.railway.internal 8080

Stack: Python 3.12, httpx 0.28.1, FastAPI, Uvicorn

All services bind to:0.0.0.0

How We Make Requests

import httpx

transport = httpx.AsyncHTTPTransport(local_address="0.0.0.0") # Force IPv4

async with httpx.AsyncClient(timeout=10.0, transport=transport) as client:

response = await client.get("http://service-b.railway.internal:8001/health")

# ^^^ Times out after 10 seconds

Evidence

service-b logs (healthy, receiving Railway health checks):

INFO: Uvicorn running on http://0.0.0.0:8001

INFO: 100.64.0.2:44241 - "GET /health HTTP/1.1" 200 OK

service-a logs (healthy, but cannot connect out):

INFO: 100.64.0.2:54803 - "GET /health HTTP/1.1" 200 OK <-- Railway check works

ERROR - service-b health check failed: Connection timeout (DNS: {

'resolved': ['fd12:xxxx:xxxx:x:xxxx:8001', '10.191.97.36:8001'],

'time_ms': 147.49, 'error': None

})

DNS resolves correctly to both IPv6 and IPv4. IPs update after redeploys. Connections still timeout.

What We Tried

  • Redeployed all services → Still broken
  • Disabled Metal Edge → Still broken
  • Verified fresh DNS IPs after redeploy → Still broken
  • Forced IPv4 binding → Still broken
  • Retry with exponential backoff → Still broken
  • Waited 15+ minutes → Still broken

Timeline Pattern

  1. Services work fine
  2. Deploy any service
  3. Inter-service communication breaks immediately
  4. All services show healthy in dashboard
  5. Railway health checks succeed (from 100.64.0.x)
  6. *.railway.internal calls timeout indefinitely
  7. No automatic recovery

Key Observation

Railway's health checks from 100.64.0.x reach all services. Only service-to-service communication fails. This suggests Wireguard mesh routing issue, not service configuration.

Questions

  1. Known issues with private networking in europe-west4?
  2. Way to force-refresh private network routing?
  3. Should we try a different region?
Solved$20 Bounty

Pinned Solution

The issue is likely caused by a conflict between IPv6 internal DNS and your httpx forced IPv4 binding. If the DNS returns the IPv6 address, when httpx attempts to connect to this IPv6 address using a socket explicitly bound to 0.0.0.0 (IPv4), the connection fails.

Your logs show the IPv6 address is resolved first (['fd12:xxxx:xxxx:x:xxxx:8001', '10.191.97.36:8001']), so httpx defaults to it.

The fix:

you should stop forcing IPv4. Instead force or allow IPv6 so httpx can use the IPv6 address returned by the DNS. Change local_address="0.0.0.0" to local_address="::"

You might also need to update service binding. Binding to :: usually accepts both IPv6 and IPv4 traffic.

It would be something like this:uvicorn main:app --host :: --port 8001

2 Replies

The issue is likely caused by a conflict between IPv6 internal DNS and your httpx forced IPv4 binding. If the DNS returns the IPv6 address, when httpx attempts to connect to this IPv6 address using a socket explicitly bound to 0.0.0.0 (IPv4), the connection fails.

Your logs show the IPv6 address is resolved first (['fd12:xxxx:xxxx:x:xxxx:8001', '10.191.97.36:8001']), so httpx defaults to it.

The fix:

you should stop forcing IPv4. Instead force or allow IPv6 so httpx can use the IPv6 address returned by the DNS. Change local_address="0.0.0.0" to local_address="::"

You might also need to update service binding. Binding to :: usually accepts both IPv6 and IPv4 traffic.

It would be something like this:uvicorn main:app --host :: --port 8001


florianpreusner
PROOP

4 months ago

Confirmed, this was the issue.

Removing the forced IPv4 binding and allowing IPv6 resolved the inter-service communication immediately.

Changing local_address to :: and binding Uvicorn to --host :: fixed the timeouts. Everything is working as expected now.

Thanks for the clear explanation and quick help, @darseen


Status changed to Solved brody 4 months ago


Welcome!

Sign in to your Railway account to join the conversation.

Loading...