Private networking completely broken between services after deployment - connection timeouts despite healthy services
florianpreusner
PROOP

a month ago

Environment

  • Region: europe-west4

  • Runtime: V2

  • Metal Edge: Disabled

Problem

After deploying any service, inter-service communication via *.railway.internal times out. All services are healthy (receiving Railway health checks from 100.64.0.x), but cannot reach each other.

This is NOT DNS caching - DNS returns fresh IPs after redeploys, but connections still timeout.

Setup

  • service-a service-a.railway.internal 8000

  • service-b service-b.railway.internal 8001

  • service-c service-c.railway.internal 8080

Stack: Python 3.12, httpx 0.28.1, FastAPI, Uvicorn
All services bind to:0.0.0.0

How We Make Requests

import httpx

transport = httpx.AsyncHTTPTransport(local_address="0.0.0.0") # Force IPv4

async with httpx.AsyncClient(timeout=10.0, transport=transport) as client:

response = await client.get("http://service-b.railway.internal:8001/health")

# ^^^ Times out after 10 seconds

Evidence

service-b logs (healthy, receiving Railway health checks):

INFO: Uvicorn running on http://0.0.0.0:8001

INFO: 100.64.0.2:44241 - "GET /health HTTP/1.1" 200 OK

service-a logs (healthy, but cannot connect out):

INFO: 100.64.0.2:54803 - "GET /health HTTP/1.1" 200 OK <-- Railway check works

ERROR - service-b health check failed: Connection timeout (DNS: {

'resolved': ['fd12:xxxx:xxxx:x:xxxx:8001', '10.191.97.36:8001'],

'time_ms': 147.49, 'error': None

})

DNS resolves correctly to both IPv6 and IPv4. IPs update after redeploys. Connections still timeout.

What We Tried

  • Redeployed all services → Still broken

  • Disabled Metal Edge → Still broken

  • Verified fresh DNS IPs after redeploy → Still broken

  • Forced IPv4 binding → Still broken

  • Retry with exponential backoff → Still broken

  • Waited 15+ minutes → Still broken

Timeline Pattern

  1. Services work fine

  2. Deploy any service

  3. Inter-service communication breaks immediately

  4. All services show healthy in dashboard

  5. Railway health checks succeed (from 100.64.0.x)

  6. *.railway.internal calls timeout indefinitely

  7. No automatic recovery

Key Observation

Railway's health checks from 100.64.0.x reach all services. Only service-to-service communication fails. This suggests Wireguard mesh routing issue, not service configuration.

Questions

  1. Known issues with private networking in europe-west4?

  2. Way to force-refresh private network routing?

  3. Should we try a different region?

Solved$20 Bounty

Pinned Solution

darseen
HOBBYTop 1% Contributor

a month ago

The issue is likely caused by a conflict between IPv6 internal DNS and your httpx forced IPv4 binding. If the DNS returns the IPv6 address, when httpx attempts to connect to this IPv6 address using a socket explicitly bound to 0.0.0.0 (IPv4), the connection fails.

Your logs show the IPv6 address is resolved first (['fd12:xxxx:xxxx:x:xxxx:8001', '10.191.97.36:8001']), so httpx defaults to it.

The fix:
you should stop forcing IPv4. Instead force or allow IPv6 so httpx can use the IPv6 address returned by the DNS. Change local_address="0.0.0.0" to local_address="::"

You might also need to update service binding. Binding to :: usually accepts both IPv6 and IPv4 traffic.
It would be something like this:uvicorn main:app --host :: --port 8001

2 Replies

darseen
HOBBYTop 1% Contributor

a month ago

The issue is likely caused by a conflict between IPv6 internal DNS and your httpx forced IPv4 binding. If the DNS returns the IPv6 address, when httpx attempts to connect to this IPv6 address using a socket explicitly bound to 0.0.0.0 (IPv4), the connection fails.

Your logs show the IPv6 address is resolved first (['fd12:xxxx:xxxx:x:xxxx:8001', '10.191.97.36:8001']), so httpx defaults to it.

The fix:
you should stop forcing IPv4. Instead force or allow IPv6 so httpx can use the IPv6 address returned by the DNS. Change local_address="0.0.0.0" to local_address="::"

You might also need to update service binding. Binding to :: usually accepts both IPv6 and IPv4 traffic.
It would be something like this:uvicorn main:app --host :: --port 8001


florianpreusner
PROOP

a month ago

Confirmed, this was the issue.

Removing the forced IPv4 binding and allowing IPv6 resolved the inter-service communication immediately.

Changing local_address to :: and binding Uvicorn to --host :: fixed the timeouts. Everything is working as expected now.

Thanks for the clear explanation and quick help, @darseen pray emoji


Status changed to Solved brody about 1 month ago


Loading...