ATXMATS-Ai-System Help

Question

Project: ATXMats-Ai-System

Project ID: ef99aec1-3fe6-41f9-9654-ef9ef81ae5bd

Environment: production

Region: us-west2

Plan: Pro

Affected services:

\- Postgres (the managed database service in this project)

\- atx-agent-ensemble (service ID e0baebcc-5c62-461c-92ba-05cc7d54032c)

depends on Postgres via DATABASE\_URL private networking

Symptom:

Since approximately 11:30 UTC on 2026-04-09, the Postgres service in this

project has stopped accepting new connections from any client.

\- Direct connection via DATABASE\_PUBLIC\_URL ([junction.proxy.rlwy.net:59560](http://junction.proxy.rlwy.net:59560))

returns "connection timeout expired" — consistent across 15 consecutive

attempts over a 16-minute window, every 60s.

\- Earlier attempts (before the consistent timeout) returned

"server closed the connection unexpectedly ... server terminated

abnormally before or while processing the request."

\- The private-network URL (postgres.railway.internal:5432) is equally

unreachable — four app service redeploys attempted today all hung in

the FastAPI startup event at engine.connect(), which eventually caused

Railway to mark the deploys FAILED.

Failed atx-agent-ensemble deployment IDs from today:

\- e2333dc8-4a29-47fb-8fdf-4f8b7ab98816

\- 02bce2c8-1e82-4977-b312-07a453c0c8c4

\- 5a1bd68d-b4ab-459a-adaf-85d4624055da

\- 50bb7159-82f3-40e2-87d9-1c948c59cac9

The currently-active app deployment (ffde9c2c-3972-4a63-9d15-998d00a4b1d8,

from 2026-04-09 01:27 UTC) still returns 200 on /health (no DB call), but

any endpoint that queries Postgres times out. The container's pool appears

to be either stale or blocked waiting on a dead socket.

Status dashboard claims:

The Postgres service's deployment status is reported as SUCCESS in the

Railway dashboard (last successful deployment 4f577c1f-3e0d-4093-90b0-

d75632a2e20a from 2026-04-05). But the actual DB process is clearly not

accepting connections.

Things I've tried:

1\. railway redeploy -y on atx-agent-ensemble (4 attempts, all FAILED)

2\. git push with an empty commit to trigger a clean deploy (FAILED)

3\. git push with a code fix that adds lock\_timeout/statement\_timeout

guards around Base.metadata.create\_all() — build succeeded, startup

still hangs (FAILED)

4\. railway restart -s Postgres --yes — the CLI command itself hung

indefinitely and had to be killed from the client side

5\. Direct psycopg connection to the public proxy URL — 15/15 timeouts

Postgres service logs (latest entries from railway logs --service Postgres):

2026-04-09 01:28:31.023 UTC $$14508$$ LOG: SSL error: unexpected eof while reading

2026-04-09 01:28:31.023 UTC $$14508$$ LOG: could not receive data from client: Connection reset by peer

2026-04-09 01:28:45.887 UTC $$82$$ LOG: checkpoint starting: time

2026-04-09 01:28:49.222 UTC $$82$$ LOG: checkpoint complete

2026-04-09 01:33:45.319 UTC $$82$$ LOG: checkpoint starting: time

2026-04-09 01:33:45.437 UTC $$82$$ LOG: checkpoint complete

2026-04-09 11:41:27.999 UTC $$15354$$ LOG: could not receive data from client: Connection timed out

There is NO entry from 11:41:27 UTC onwards, which is consistent with the

Postgres process being unable to log, receive new clients, or respond to

the control plane.

Status page:

The public status page ([status.railway.com](http://status.railway.com)) shows only "Degraded

Dashboard Performance" (slug QHUNBGIO) for today — no database or

infrastructure incident that matches this symptom.

Request:

Please restart the Postgres service from the control plane side (the

railway restart CLI hung, so I suspect the service is stuck in a state

my client can't dislodge). If a simple restart doesn't help, please

check whether the underlying DB instance has crashed or hit a resource

limit. This is blocking a production Stripe webhook deployment — the

webhook endpoint is configured and signed events are being sent, but

the app can't load the secret without a successful redeploy.

Business impact:

\- Live production Stripe webhook is receiving events against a stale

container that can't process them (they return 200 but aren't posted

to the ledger).

\- Amazon SP-API settlement ingest (scheduled Celery task) will fail

on the next run (6:15 AM CT tomorrow if not before).

\- All ledger / accounting endpoints are down for internal operators.

Happy to provide any additional diagnostic output. Thank you.