a month ago
Edge routing degraded for healthy container — pr-agency-v2 (worthy-reflection)
Project: worthy-reflection
Service: pr-agency-v2 (production environment)
Custom domain: pairpr.ai
Railway-issued domain: pr-agency-v2-production.up.railway.app
Latest deployment (Active): commit 19a2120b (May 20, 06:32 AM CDT)
Summary
Our service has been unreachable for ~15 hours, despite the container being healthy at
the application layer. Symptoms have shifted across multiple states, all pointing to
edge/internal-network degradation specific to this service or project — not to our
code. Railway's public status page currently reports "Fully Operational" for Edge
Network, but our reality contradicts that.
Timeline (UTC)
- May 19 22:29 UTC — Original platform incident began (acknowledged on status page).
- May 19 23:20 UTC — Successful deploy of commit dc3fda3; container healthy at boot but
pairpr.ai began returning HTTP 502 with response header x-railway-fallback: true.
- May 20 ~11:00 UTC — TCP proxy for Postgres (junction.proxy.rlwy.net:31568) returns
"server closed the connection unexpectedly" on every external connection attempt.
Internal Postgres works (migrations ran successfully from inside the container).
- May 20 11:43 UTC — Redeployed after setting RATELIMIT_STORAGE_URI=memory:// to bypass
Redis dependency (Flask-Limiter was blocking worker boot). Deploy logs confirm
gunicorn listening on 0.0.0.0:8080, scheduler started with 55 jobs, workers booted, no
crash loop.
- May 20 ~12:00 UTC — HTTP Logs (Deploy tab) show traffic IS reaching the container but
every request returns HTTP 499 with ~9s response time. One request hung 8m21s. This
pattern indicates worker hangs, likely on pool_pre_ping DB roundtrips to
postgres.railway.internal (suggesting internal networking is slow or degraded).
- May 20 ~12:40 UTC — Edge behavior changed again: TLS handshake on pairpr.ai succeeds
but no HTTP response follows — connections hang indefinitely until client timeout. No
x-railway-fallback header anymore. The auto-generated
pr-agency-v2-production.up.railway.app URL shows the same behavior.
Confirmed-healthy evidence (container side)
- Deploy logs show: [INFO] Starting gunicorn 26.0.0, [INFO] Listening at:
http://0.0.0.0:8080, [INFO] Booting worker with pid: 6/7, [Scheduler] Started with 55
jobs.
- Networking config in Service Settings: pairpr.ai → Port 8080 (matches gunicorn bind).
- flask db upgrade completes successfully — migration tree is single-head.
- No exceptions or crash messages after worker boot.
Things we've tried
- Multiple redeploys
- Restart via dashboard
- Cleared stale alembic_version orphan rows via dashboard Data tab
- Added RATELIMIT_STORAGE_URI=memory:// to bypass Redis
- Verified port match between gunicorn bind and networking config
What we need
Please investigate (1) why our edge is not forwarding HTTP responses to clients despite
a healthy container, and (2) whether our service's internal Postgres + Redis
networking is degraded relative to the rest of the platform (which would explain the
9-second hangs we saw in HTTP Logs).
Project ID and service ID available on request.
Urgency: Production outage during active demo prep. Multiple paying client workflows
blocked.
5 Replies
a month ago
This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.
Status changed to Open Railway • 28 days ago
a month ago
Try removing your custom domain then add it back. Update DNS records if necessary.
0x5b62656e5d
Try removing your custom domain then add it back. Update DNS records if necessary.
a month ago
Unfortunately this did not work.
pragencypro
Unfortunately this did not work.
a month ago
Break the DB connection hang — most critical
Your workers are blocking on pool_pre_ping against a degraded internal Postgres. Add these env vars immediately in Railway dashboard → Service → Variables:
ALCHEMY_ENGINE_OPTIONS={"pool_pre_ping": false, "pool_recycle": 300, "connect_args": {"connect_timeout": 5}}
Or if you configure the engine directly in code, change it to:
engine = create_engine(
DATABASE_URL,
pool_pre_ping=False, # REMOVE this — it's causing the hang
pool_recycle=300,
pool_size=2, # reduce — gunicorn sync workers × 2 max
max_overflow=0,
connect_args={"connect_timeout": 5} # hard timeout so workers don't block)
The connect_timeout=5 is the key — without it, a stalled internal network means workers block indefinitely (you saw 8m21s).
- Verify you're using the private Postgres host
In your env vars, confirm DATABASE_URL uses postgres.railway.internal — not the TCP proxy (junction.proxy.rlwy.net:31568). The external TCP proxy is already broken per your timeline. The internal hostname should work since your migrations ran fine.
DATABASE_URL=postgresql://postgres:@postgres.railway.internal:5432/railway
- Add a fast, DB-free health check endpoint
Railway's edge proxy sends health check requests to the root path / by default; if that path takes too long or doesn't return 2xx quickly, the proxy assumes the app is down and returns 502. Railway
Add this to your Flask app and set it as the health check path in Railway service settings:
@app.route("/healthz")
def healthz():
return {"status": "ok"}, 200 # NO db query hereThen in Railway: Service Settings → Health Check Path → /healthz
- Reduce gunicorn workers and add a hard timeout
With only 2 sync workers and a stalled DB, your entire capacity is consumed. Set these env vars:
WEB_CONCURRENCY=2
GUNICORN_TIMEOUT=30
And in your start command:
gunicorn app:app --bind 0.0.0.0:$PORT --workers 2 --timeout 30 --worker-class sync
The --timeout 30 forces Railway to SIGKILL hung workers and boot fresh ones, rather than letting them hang forever.
- Force a clean redeploy after the above changes
After setting all env vars, trigger a new deploy. The previous deployment's workers are in a hung state that a restart alone won't fix if the DB connection pool is still poisoned
hope it helps
a month ago
Thankfully we are back online now. This helped a ton directionally. Happy you all are back online.
pragencypro
Thankfully we are back online now. This helped a ton directionally. Happy you all are back online.
a month ago
glad to hear it please mark the bounty solved if your issue is resolved
cheers
Status changed to Open brody • 28 days ago
Status changed to Solved pragencypro • 27 days ago