Edge routing degraded for healthy container — pr-agency-v2 (worthy-reflection)
pragencypro
PROOP

a month ago

Edge routing degraded for healthy container — pr-agency-v2 (worthy-reflection)

Project: worthy-reflection

Service: pr-agency-v2 (production environment)

Custom domain: pairpr.ai

Railway-issued domain: pr-agency-v2-production.up.railway.app

Latest deployment (Active): commit 19a2120b (May 20, 06:32 AM CDT)

Summary

Our service has been unreachable for ~15 hours, despite the container being healthy at

the application layer. Symptoms have shifted across multiple states, all pointing to

edge/internal-network degradation specific to this service or project — not to our

code. Railway's public status page currently reports "Fully Operational" for Edge

Network, but our reality contradicts that.

Timeline (UTC)

  • May 19 22:29 UTC — Original platform incident began (acknowledged on status page).
  • May 19 23:20 UTC — Successful deploy of commit dc3fda3; container healthy at boot but

pairpr.ai began returning HTTP 502 with response header x-railway-fallback: true.

  • May 20 ~11:00 UTC — TCP proxy for Postgres (junction.proxy.rlwy.net:31568) returns

"server closed the connection unexpectedly" on every external connection attempt.

Internal Postgres works (migrations ran successfully from inside the container).

  • May 20 11:43 UTC — Redeployed after setting RATELIMIT_STORAGE_URI=memory:// to bypass

Redis dependency (Flask-Limiter was blocking worker boot). Deploy logs confirm

gunicorn listening on 0.0.0.0:8080, scheduler started with 55 jobs, workers booted, no

crash loop.

  • May 20 ~12:00 UTC — HTTP Logs (Deploy tab) show traffic IS reaching the container but

every request returns HTTP 499 with ~9s response time. One request hung 8m21s. This

pattern indicates worker hangs, likely on pool_pre_ping DB roundtrips to

postgres.railway.internal (suggesting internal networking is slow or degraded).

  • May 20 ~12:40 UTC — Edge behavior changed again: TLS handshake on pairpr.ai succeeds

but no HTTP response follows — connections hang indefinitely until client timeout. No

x-railway-fallback header anymore. The auto-generated

pr-agency-v2-production.up.railway.app URL shows the same behavior.

Confirmed-healthy evidence (container side)

  • Deploy logs show: [INFO] Starting gunicorn 26.0.0, [INFO] Listening at:

http://0.0.0.0:8080, [INFO] Booting worker with pid: 6/7, [Scheduler] Started with 55

jobs.

  • Networking config in Service Settings: pairpr.ai → Port 8080 (matches gunicorn bind).
  • flask db upgrade completes successfully — migration tree is single-head.
  • No exceptions or crash messages after worker boot.

Things we've tried

  1. Multiple redeploys
  2. Restart via dashboard
  3. Cleared stale alembic_version orphan rows via dashboard Data tab
  4. Added RATELIMIT_STORAGE_URI=memory:// to bypass Redis
  5. Verified port match between gunicorn bind and networking config

What we need

Please investigate (1) why our edge is not forwarding HTTP responses to clients despite

a healthy container, and (2) whether our service's internal Postgres + Redis

networking is degraded relative to the rest of the platform (which would explain the

9-second hangs we saw in HTTP Logs).

Project ID and service ID available on request.

Urgency: Production outage during active demo prep. Multiple paying client workflows

blocked.

Solved$20 Bounty

5 Replies

Railway
BOT

a month ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open Railway 28 days ago


Try removing your custom domain then add it back. Update DNS records if necessary.


0x5b62656e5d

Try removing your custom domain then add it back. Update DNS records if necessary.

pragencypro
PROOP

a month ago

Unfortunately this did not work.


pragencypro

Unfortunately this did not work.

aayankali
FREE

a month ago

Break the DB connection hang — most critical

Your workers are blocking on pool_pre_ping against a degraded internal Postgres. Add these env vars immediately in Railway dashboard → Service → Variables:

ALCHEMY_ENGINE_OPTIONS={"pool_pre_ping": false, "pool_recycle": 300, "connect_args": {"connect_timeout": 5}}

Or if you configure the engine directly in code, change it to:

engine = create_engine(

DATABASE_URL,

pool_pre_ping=False,        # REMOVE this — it's causing the hang

pool_recycle=300,

pool_size=2,                # reduce — gunicorn sync workers × 2 max

max_overflow=0,

connect_args={"connect_timeout": 5}  # hard timeout so workers don't block

)

The connect_timeout=5 is the key — without it, a stalled internal network means workers block indefinitely (you saw 8m21s).

  1. Verify you're using the private Postgres host

In your env vars, confirm DATABASE_URL uses postgres.railway.internal — not the TCP proxy (junction.proxy.rlwy.net:31568). The external TCP proxy is already broken per your timeline. The internal hostname should work since your migrations ran fine.

DATABASE_URL=postgresql://postgres:@postgres.railway.internal:5432/railway

  1. Add a fast, DB-free health check endpoint

Railway's edge proxy sends health check requests to the root path / by default; if that path takes too long or doesn't return 2xx quickly, the proxy assumes the app is down and returns 502. Railway

Add this to your Flask app and set it as the health check path in Railway service settings:

@app.route("/healthz")

def healthz():

return {"status": "ok"}, 200  # NO db query here

Then in Railway: Service Settings → Health Check Path → /healthz

  1. Reduce gunicorn workers and add a hard timeout

With only 2 sync workers and a stalled DB, your entire capacity is consumed. Set these env vars:

WEB_CONCURRENCY=2

GUNICORN_TIMEOUT=30

And in your start command:

gunicorn app:app --bind 0.0.0.0:$PORT --workers 2 --timeout 30 --worker-class sync

The --timeout 30 forces Railway to SIGKILL hung workers and boot fresh ones, rather than letting them hang forever.

  1. Force a clean redeploy after the above changes

After setting all env vars, trigger a new deploy. The previous deployment's workers are in a hung state that a restart alone won't fix if the DB connection pool is still poisoned

hope it helps


pragencypro
PROOP

a month ago

Thankfully we are back online now. This helped a ton directionally. Happy you all are back online.


pragencypro

Thankfully we are back online now. This helped a ton directionally. Happy you all are back online.

aayankali
FREE

a month ago

glad to hear it please mark the bounty solved if your issue is resolved

cheers


Status changed to Open brody 28 days ago


Status changed to Solved pragencypro 27 days ago


Welcome!

Sign in to your Railway account to join the conversation.

Loading...