Private networking broken between services after Postgres crash recovery
liamdefty
PROOP

13 days ago

Subject: Private networking broken between services after Postgres restart

Our PostgreSQL service went down and after bringing it back up, the private network connection between our app service and Postgres service is broken. Our production app is currently down and we are unable to deploy a fix because the build queue is backed up.

Details:

  • Project ID: 9293b7ee-f9f4-4a66-a22e-ca6e96a32fa4

  • Environment ID: 61f5a445-b1d4-4ede-acdf-43d62490813d

  • App service ID: a2aeffdd-13a6-496c-8102-6ef0ae870f8d

  • The app cannot reach Postgres at its private network address using the template variable

  • The database is healthy and reachable via the public url

  • We have restarted both the app and Postgres services — private networking still does not recover

  • We need to switch DATABASE_URL to the public proxy as a workaround, but the deploy is stuck in the build queue

Could you either restore private networking between these services or prioritise our deploy in the build queue so we can get back online?

$20 Bounty

5 Replies

liamdefty
PROOP

13 days ago

Once this issue is resolved I will need to understand exactly why this happened, why did the database not attempt to re-connect the first time and why did it cause private networking to fail between these services? Was it something we've mis-configured?

I'm a little concerned for using this platform and have to have a difficult conversation with stakeholders after this incident, but hope to continue to use it as it makes my life so much easier


liamdefty
PROOP

13 days ago

Update: build queue has cleared and back online with the temp workaround using the public database url

Any help getting to bottom of this will be appreciated


Railway
BOT

13 days ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open Railway 13 days ago


liamdefty
PROOP

13 days ago

Update: I'm looking at implementing this on our produciton environment https://github.com/railwayapp-templates/postgres-ha

Anyone else have any experience?


liamdefty
PROOP

13 days ago

Thanks so much for your help, really appreciate the thorough explanation. I've gone ahead and implemented a more robust solution: an adaptive connection pool that automatically falls back to the public database URL if the private network is unreachable, and periodically re-probes the private URL to switch back when it recovers. This ensures the app self-heals without needing a redeploy.

However, even after the database has fully restarted and the app service has been restarted, the private network URL still isn't working — the app is consistently falling back to the public URL. The fallback is great to have in place for when this happens, but we don't want to be running through the public proxy long term for obvious reasons (cost and latency).

Does it sound like this specific case needs someone at Railway to take a look? It seems like the private network route between these two services just isn't recovering on its own, even with fresh restarts on both sides.


pavankumar2812
FREE

9 days ago

One possibility is stale DNS resolution for the private service hostname after the Postgres container restarted.

On Railway, private networking between services relies on internal DNS. When a service crashes or is redeployed, the underlying container IP can change. If the application runtime or connection pool cached the previous IP, it will keep attempting to connect to the old address even though the database is healthy again.

This would explain why:

• the database works via the public proxy

• restarting services did not immediately restore connectivity

• switching DATABASE_URL to the public endpoint worked

Some runtimes (Node, Java, Go connection pools, etc.) cache DNS results longer than expected, especially inside long-lived connection pools.

You can confirm this by resolving the hostname from inside the app container:

getent hosts <postgres-private-host>

If the IP differs from what the app originally connected to, the issue is likely stale DNS.

Using a pool that periodically refreshes DNS or setting a lower DNS TTL usually prevents this after container restarts.


Loading...