Hi, I am facing a critical issue with my production app where the backend becomes unresponsive after a period of time, requiring a manual redeploy to fix. **The Stack:** * Frontend: Next.js * Backend: Python * Database: Postgres **The Symptoms:** * The app ([summachat.com](http://summachat.com)) works fine after a fresh deploy. * After some time (hours/days), the frontend gets stuck on a loading state. * The backend service shows as "Online" in the Railway dashboard (no crash reported). * **The Fix:** If I manually redeploy the _exact same_ backend commit, the app immediately starts working again. **Project ID:**`5f090606-c2e6-456d-b71e-9fd698cf176b,` Could someone please check the service metrics/health to see why it hangs without crashing? Thanks!

Python Backend hangs indefinitely (loading spinner) until manual redeploy

shahin-behzadrad

PROOP

5 months ago

Hi,

I am facing a critical issue with my production app where the backend becomes unresponsive after a period of time, requiring a manual redeploy to fix.

The Stack:

Frontend: Next.js
Backend: Python
Database: Postgres

The Symptoms:

The app (summachat.com) works fine after a fresh deploy.
After some time (hours/days), the frontend gets stuck on a loading state.
The backend service shows as "Online" in the Railway dashboard (no crash reported).
The Fix: If I manually redeploy the exact same backend commit, the app immediately starts working again.

Project ID:5f090606-c2e6-456d-b71e-9fd698cf176b,

Could someone please check the service metrics/health to see why it hangs without crashing?

Thanks!

$10 Bounty

6 Replies

Railway

BOT

5 months ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open Railway • 5 months ago

Status changed to Solved shahin-behzadrad • 5 months ago

shahin-behzadrad

PROOP

5 months ago

Hi,

I am facing a critical issue with my production app where the backend becomes unresponsive after a period of time, requiring a manual redeploy to fix.

The Stack:

Frontend: Next.js
Backend: Python
Database: Postgres

The Symptoms:

The app (summachat.com) works fine after a fresh deploy.
After some time (hours/days), the frontend gets stuck on a loading state.
The backend service shows as "Online" in the Railway dashboard (no crash reported).
The Fix: If I manually redeploy the exact same backend commit, the app immediately starts working again.

Project ID:5f090606-c2e6-456d-b71e-9fd698cf176b,

Could someone please check the service metrics/health to see why it hangs without crashing?

Thanks!

Status changed to Open ray-chen • 5 months ago

0x5b62656e5d

MODERATOR

5 months ago

Are you using serverless?

0x5b62656e5d

Are you using serverless?

shahin-behzadrad

PROOP

5 months ago

No, the app runs as containerized services on Railway (Next.js frontend + FastAPI backend). They’re long-running processes, not serverless functions.

0x5b62656e5d

MODERATOR

5 months ago

Not a serverless function.

I meant this:

Attachments

14038.png

0x5b62656e5d

Not a serverless function. I meant this:![](https://station-server.railway.com/attachments/att_01kh3av39ve8nbp4xj8mcd3vkp)

shahin-behzadrad

PROOP

5 months ago

no its not serverless

Attachments

Screenshot%...

shahin-behzadrad

PROOP

5 months ago

Thank you so much for the detailed breakdown, this was incredibly helpful!

I've implemented all of your suggestions:

Health check — Updated my /health endpoint to actually test DB connectivity with SELECT 1 instead of just returning {"status": "ok"}. Configured it as the healthcheck path in Railway.
Gunicorn with async workers — Switched from raw uvicorn to gunicorn -w 4 -k uvicorn.workers.UvicornWorker --timeout 90. The worker timeout alone should prevent the silent hang, if a worker gets stuck, gunicorn will kill and restart it automatically.
Connection pool tuning — Reduced from pool_size=50 + max_overflow=50 (100 total) down to pool_size=5 + max_overflow=10 per worker. With 4 gunicorn workers, the old config could have opened up to 400 connections against Railway Postgres — almost certainly the root cause of the exhaustion.
Pool recycle — Reduced from 30 minutes to 5 minutes since Railway Postgres can drop idle connections sooner.

Deploying to staging first to validate, then rolling out to production. Really appreciate the thorough response, saved me a lot of debugging time!

Status changed to Open brody • 5 months ago

Welcome!