a month ago
Hi,
I am facing a critical issue with my production app where the backend becomes unresponsive after a period of time, requiring a manual redeploy to fix.
The Stack:
Frontend: Next.js
Backend: Python
Database: Postgres
The Symptoms:
The app (summachat.com) works fine after a fresh deploy.
After some time (hours/days), the frontend gets stuck on a loading state.
The backend service shows as "Online" in the Railway dashboard (no crash reported).
The Fix: If I manually redeploy the exact same backend commit, the app immediately starts working again.
Project ID:5f090606-c2e6-456d-b71e-9fd698cf176b,
Could someone please check the service metrics/health to see why it hangs without crashing?
Thanks!
6 Replies
a month ago
This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.
Status changed to Open Railway • 29 days ago
Status changed to Solved shahin-behzadrad • 29 days ago
a month ago
Hi,
I am facing a critical issue with my production app where the backend becomes unresponsive after a period of time, requiring a manual redeploy to fix.
The Stack:
Frontend: Next.js
Backend: Python
Database: Postgres
The Symptoms:
The app (summachat.com) works fine after a fresh deploy.
After some time (hours/days), the frontend gets stuck on a loading state.
The backend service shows as "Online" in the Railway dashboard (no crash reported).
The Fix: If I manually redeploy the exact same backend commit, the app immediately starts working again.
Project ID:5f090606-c2e6-456d-b71e-9fd698cf176b,
Could someone please check the service metrics/health to see why it hangs without crashing?
Thanks!
Status changed to Open ray-chen • 29 days ago
0x5b62656e5d
Are you using serverless?
a month ago
No, the app runs as containerized services on Railway (Next.js frontend + FastAPI backend). They’re long-running processes, not serverless functions.
a month ago
Not a serverless function.
I meant this:
Attachments
0x5b62656e5d
Not a serverless function.I meant this:
a month ago
no its not serverless
Attachments
a month ago
Thank you so much for the detailed breakdown, this was incredibly helpful!
I've implemented all of your suggestions:
Health check — Updated my /health endpoint to actually test DB connectivity with SELECT 1 instead of just returning {"status": "ok"}. Configured it as the healthcheck path in Railway.
Gunicorn with async workers — Switched from raw uvicorn to gunicorn -w 4 -k uvicorn.workers.UvicornWorker --timeout 90. The worker timeout alone should prevent the silent hang, if a worker gets stuck, gunicorn will kill and restart it automatically.
Connection pool tuning — Reduced from pool_size=50 + max_overflow=50 (100 total) down to pool_size=5 + max_overflow=10 per worker. With 4 gunicorn workers, the old config could have opened up to 400 connections against Railway Postgres — almost certainly the root cause of the exhaustion.
Pool recycle — Reduced from 30 minutes to 5 minutes since Railway Postgres can drop idle connections sooner.
Deploying to staging first to validate, then rolling out to production. Really appreciate the thorough response, saved me a lot of debugging time!
Status changed to Open brody • 27 days ago