ATXMATS-Ai-System Help
Anonymous
PROOP

a month ago

Project: ATXMats-Ai-System

Project ID: ef99aec1-3fe6-41f9-9654-ef9ef81ae5bd

Environment: production

Region: us-west2

Plan: Pro

Affected services:

- Postgres (the managed database service in this project)

- atx-agent-ensemble (service ID e0baebcc-5c62-461c-92ba-05cc7d54032c)

depends on Postgres via DATABASE_URL private networking

Symptom:

Since approximately 11:30 UTC on 2026-04-09, the Postgres service in this

project has stopped accepting new connections from any client.

- Direct connection via DATABASE_PUBLIC_URL (junction.proxy.rlwy.net:59560)

returns "connection timeout expired" — consistent across 15 consecutive

attempts over a 16-minute window, every 60s.

- Earlier attempts (before the consistent timeout) returned

"server closed the connection unexpectedly ... server terminated

abnormally before or while processing the request."

- The private-network URL (postgres.railway.internal:5432) is equally

unreachable — four app service redeploys attempted today all hung in

the FastAPI startup event at engine.connect(), which eventually caused

Railway to mark the deploys FAILED.

Failed atx-agent-ensemble deployment IDs from today:

- e2333dc8-4a29-47fb-8fdf-4f8b7ab98816

- 02bce2c8-1e82-4977-b312-07a453c0c8c4

- 5a1bd68d-b4ab-459a-adaf-85d4624055da

- 50bb7159-82f3-40e2-87d9-1c948c59cac9

The currently-active app deployment (ffde9c2c-3972-4a63-9d15-998d00a4b1d8,

from 2026-04-09 01:27 UTC) still returns 200 on /health (no DB call), but

any endpoint that queries Postgres times out. The container's pool appears

to be either stale or blocked waiting on a dead socket.

Status dashboard claims:

The Postgres service's deployment status is reported as SUCCESS in the

Railway dashboard (last successful deployment 4f577c1f-3e0d-4093-90b0-

d75632a2e20a from 2026-04-05). But the actual DB process is clearly not

accepting connections.

Things I've tried:

1. railway redeploy -y on atx-agent-ensemble (4 attempts, all FAILED)

2. git push with an empty commit to trigger a clean deploy (FAILED)

3. git push with a code fix that adds lock_timeout/statement_timeout

guards around Base.metadata.create_all() — build succeeded, startup

still hangs (FAILED)

4. railway restart -s Postgres --yes — the CLI command itself hung

indefinitely and had to be killed from the client side

5. Direct psycopg connection to the public proxy URL — 15/15 timeouts

Postgres service logs (latest entries from railway logs --service Postgres):

2026-04-09 01:28:31.023 UTC [14508] LOG: SSL error: unexpected eof while reading

2026-04-09 01:28:31.023 UTC [14508] LOG: could not receive data from client: Connection reset by peer

2026-04-09 01:28:45.887 UTC [82] LOG: checkpoint starting: time

2026-04-09 01:28:49.222 UTC [82] LOG: checkpoint complete

2026-04-09 01:33:45.319 UTC [82] LOG: checkpoint starting: time

2026-04-09 01:33:45.437 UTC [82] LOG: checkpoint complete

2026-04-09 11:41:27.999 UTC [15354] LOG: could not receive data from client: Connection timed out

There is NO entry from 11:41:27 UTC onwards, which is consistent with the

Postgres process being unable to log, receive new clients, or respond to

the control plane.

Status page:

The public status page (status.railway.com) shows only "Degraded

Dashboard Performance" (slug QHUNBGIO) for today — no database or

infrastructure incident that matches this symptom.

Request:

Please restart the Postgres service from the control plane side (the

railway restart CLI hung, so I suspect the service is stuck in a state

my client can't dislodge). If a simple restart doesn't help, please

check whether the underlying DB instance has crashed or hit a resource

limit. This is blocking a production Stripe webhook deployment — the

webhook endpoint is configured and signed events are being sent, but

the app can't load the secret without a successful redeploy.

Business impact:

- Live production Stripe webhook is receiving events against a stale

container that can't process them (they return 200 but aren't posted

to the ledger).

- Amazon SP-API settlement ingest (scheduled Celery task) will fail

on the next run (6:15 AM CT tomorrow if not before).

- All ledger / accounting endpoints are down for internal operators.

Happy to provide any additional diagnostic output. Thank you.

$10 Bounty

1 Replies

Status changed to Awaiting Railway Response Railway about 1 month ago


Railway
BOT

a month ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open Railway about 1 month ago


When you redeploy your Postgres service, does it ever say something like "Ready to accept connections"?

Or are there errors?


Welcome!

Sign in to your Railway account to join the conversation.

Loading...