a month ago
Project: ATXMats-Ai-System
Project ID: ef99aec1-3fe6-41f9-9654-ef9ef81ae5bd
Environment: production
Region: us-west2
Plan: Pro
Affected services:
- Postgres (the managed database service in this project)
- atx-agent-ensemble (service ID e0baebcc-5c62-461c-92ba-05cc7d54032c)
depends on Postgres via DATABASE_URL private networking
Symptom:
Since approximately 11:30 UTC on 2026-04-09, the Postgres service in this
project has stopped accepting new connections from any client.
- Direct connection via DATABASE_PUBLIC_URL (junction.proxy.rlwy.net:59560)
returns "connection timeout expired" — consistent across 15 consecutive
attempts over a 16-minute window, every 60s.
- Earlier attempts (before the consistent timeout) returned
"server closed the connection unexpectedly ... server terminated
abnormally before or while processing the request."
- The private-network URL (postgres.railway.internal:5432) is equally
unreachable — four app service redeploys attempted today all hung in
the FastAPI startup event at engine.connect(), which eventually caused
Railway to mark the deploys FAILED.
Failed atx-agent-ensemble deployment IDs from today:
- e2333dc8-4a29-47fb-8fdf-4f8b7ab98816
- 02bce2c8-1e82-4977-b312-07a453c0c8c4
- 5a1bd68d-b4ab-459a-adaf-85d4624055da
- 50bb7159-82f3-40e2-87d9-1c948c59cac9
The currently-active app deployment (ffde9c2c-3972-4a63-9d15-998d00a4b1d8,
from 2026-04-09 01:27 UTC) still returns 200 on /health (no DB call), but
any endpoint that queries Postgres times out. The container's pool appears
to be either stale or blocked waiting on a dead socket.
Status dashboard claims:
The Postgres service's deployment status is reported as SUCCESS in the
Railway dashboard (last successful deployment 4f577c1f-3e0d-4093-90b0-
d75632a2e20a from 2026-04-05). But the actual DB process is clearly not
accepting connections.
Things I've tried:
1. railway redeploy -y on atx-agent-ensemble (4 attempts, all FAILED)
2. git push with an empty commit to trigger a clean deploy (FAILED)
3. git push with a code fix that adds lock_timeout/statement_timeout
guards around Base.metadata.create_all() — build succeeded, startup
still hangs (FAILED)
4. railway restart -s Postgres --yes — the CLI command itself hung
indefinitely and had to be killed from the client side
5. Direct psycopg connection to the public proxy URL — 15/15 timeouts
Postgres service logs (latest entries from railway logs --service Postgres):
2026-04-09 01:28:31.023 UTC [14508] LOG: SSL error: unexpected eof while reading
2026-04-09 01:28:31.023 UTC [14508] LOG: could not receive data from client: Connection reset by peer
2026-04-09 01:28:45.887 UTC [82] LOG: checkpoint starting: time
2026-04-09 01:28:49.222 UTC [82] LOG: checkpoint complete
2026-04-09 01:33:45.319 UTC [82] LOG: checkpoint starting: time
2026-04-09 01:33:45.437 UTC [82] LOG: checkpoint complete
2026-04-09 11:41:27.999 UTC [15354] LOG: could not receive data from client: Connection timed out
There is NO entry from 11:41:27 UTC onwards, which is consistent with the
Postgres process being unable to log, receive new clients, or respond to
the control plane.
Status page:
The public status page (status.railway.com) shows only "Degraded
Dashboard Performance" (slug QHUNBGIO) for today — no database or
infrastructure incident that matches this symptom.
Request:
Please restart the Postgres service from the control plane side (the
railway restart CLI hung, so I suspect the service is stuck in a state
my client can't dislodge). If a simple restart doesn't help, please
check whether the underlying DB instance has crashed or hit a resource
limit. This is blocking a production Stripe webhook deployment — the
webhook endpoint is configured and signed events are being sent, but
the app can't load the secret without a successful redeploy.
Business impact:
- Live production Stripe webhook is receiving events against a stale
container that can't process them (they return 200 but aren't posted
to the ledger).
- Amazon SP-API settlement ingest (scheduled Celery task) will fail
on the next run (6:15 AM CT tomorrow if not before).
- All ledger / accounting endpoints are down for internal operators.
Happy to provide any additional diagnostic output. Thank you.
1 Replies
Status changed to Awaiting Railway Response Railway • about 1 month ago
a month ago
This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.
Status changed to Open Railway • about 1 month ago
a month ago
When you redeploy your Postgres service, does it ever say something like "Ready to accept connections"?
Or are there errors?