**Subject: Private networking broken between services after Postgres restart** Our PostgreSQL service went down and after bringing it back up, the private network connection between our app service and Postgres service is broken. Our production app is currently down and we are unable to deploy a fix because the build queue is backed up. **Details:** * Project ID: `9293b7ee-f9f4-4a66-a22e-ca6e96a32fa4` * Environment ID: `61f5a445-b1d4-4ede-acdf-43d62490813d` * App service ID: `a2aeffdd-13a6-496c-8102-6ef0ae870f8d` * The app cannot reach Postgres at its private network address using the template variable * The database is healthy and reachable via the public url * We have restarted both the app and Postgres services — private networking still does not recover * We need to switch `DATABASE_URL` to the public proxy as a workaround, but the deploy is stuck in the build queue Could you either restore private networking between these services or prioritise our deploy in the build queue so we can get back online?

Private networking broken between services after Postgres crash recovery

liamdefty

PROOP

4 months ago

Subject: Private networking broken between services after Postgres restart

Our PostgreSQL service went down and after bringing it back up, the private network connection between our app service and Postgres service is broken. Our production app is currently down and we are unable to deploy a fix because the build queue is backed up.

Details:

Project ID: 9293b7ee-f9f4-4a66-a22e-ca6e96a32fa4
Environment ID: 61f5a445-b1d4-4ede-acdf-43d62490813d
App service ID: a2aeffdd-13a6-496c-8102-6ef0ae870f8d
The app cannot reach Postgres at its private network address using the template variable
The database is healthy and reachable via the public url
We have restarted both the app and Postgres services — private networking still does not recover
We need to switch DATABASE_URL to the public proxy as a workaround, but the deploy is stuck in the build queue

Could you either restore private networking between these services or prioritise our deploy in the build queue so we can get back online?

$20 Bounty

5 Replies

liamdefty

PROOP

4 months ago

Once this issue is resolved I will need to understand exactly why this happened, why did the database not attempt to re-connect the first time and why did it cause private networking to fail between these services? Was it something we've mis-configured?

I'm a little concerned for using this platform and have to have a difficult conversation with stakeholders after this incident, but hope to continue to use it as it makes my life so much easier

liamdefty

PROOP

4 months ago

Update: build queue has cleared and back online with the temp workaround using the public database url

Any help getting to bottom of this will be appreciated

Railway

BOT

4 months ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open Railway • 4 months ago

liamdefty

PROOP

4 months ago

Update: I'm looking at implementing this on our produciton environment https://github.com/railwayapp-templates/postgres-ha

Anyone else have any experience?

liamdefty

PROOP

4 months ago

Thanks so much for your help, really appreciate the thorough explanation. I've gone ahead and implemented a more robust solution: an adaptive connection pool that automatically falls back to the public database URL if the private network is unreachable, and periodically re-probes the private URL to switch back when it recovers. This ensures the app self-heals without needing a redeploy.

However, even after the database has fully restarted and the app service has been restarted, the private network URL still isn't working — the app is consistently falling back to the public URL. The fallback is great to have in place for when this happens, but we don't want to be running through the public proxy long term for obvious reasons (cost and latency).

Does it sound like this specific case needs someone at Railway to take a look? It seems like the private network route between these two services just isn't recovering on its own, even with fresh restarts on both sides.

pavankumar2812

FREE

4 months ago

One possibility is stale DNS resolution for the private service hostname after the Postgres container restarted.

On Railway, private networking between services relies on internal DNS. When a service crashes or is redeployed, the underlying container IP can change. If the application runtime or connection pool cached the previous IP, it will keep attempting to connect to the old address even though the database is healthy again.

This would explain why:

• the database works via the public proxy

• restarting services did not immediately restore connectivity

• switching DATABASE_URL to the public endpoint worked

Some runtimes (Node, Java, Go connection pools, etc.) cache DNS results longer than expected, especially inside long-lived connection pools.

You can confirm this by resolving the hostname from inside the app container:

getent hosts

If the IP differs from what the app originally connected to, the issue is likely stale DNS.

Using a pool that periodically refreshes DNS or setting a lower DNS TTL usually prevents this after container restarts.

Welcome!