Account: jrconsole@gmail.com Project: Campus Pulse Service: Postgres DB Plan: Pro Region: US East Started: 2026-05-28 ~20:13 UTC (restart action), problems pre-existed for hours ──────────────────────────────────────────── CURRENT STATE (production fully down) ──────────────────────────────────────────── After clicking Restart on our production Postgres service at ~20:05 UTC, the service has not come back up. The dashboard reports "service isn't running or is in an unexpected state" and our app cannot connect. We have not modified or deleted the volume. ~150 MB of data lives there and we need it preserved. ──────────────────────────────────────────── WHY WE RESTARTED — pre-existing severe I/O degradation ──────────────────────────────────────────── For [hours / days] before the restart, this Postgres instance has been exhibiting severe storage-layer slowness on writes: - Database is ~150 MB total. Volume usage shows $0.10. Nowhere near capacity. - A trivial single-row UPDATE on a small table consistently takes 4–6 minutes to complete, reproducible on every write. - pg_stat_activity for the running UPDATE shows the session sitting on: 1. LWLock / WALWrite (~90 seconds) 2. IO / WALWrite 3. IO / WALSync (until completion at ~5 minutes) i.e., 100% of the time is in WAL durability operations, not query execution. - pg_database_size() and pg_ls_waldir() (pure filesystem stat() calls) time out — strong indicator of filesystem-level unresponsiveness. - Your own /* railway:dataui */ dashboard queries are also timing out at the 30s statement_timeout against this DB. Examples in our logs from 20:01–20:07 UTC on 2026-05-28: - 20:01:58 UTC pid 55706 — pg_stat_activity / pg_stat_database aggregation - 20:02:54 UTC pid 55774 — pg_stat_user_indexes / pg_stat_user_tables query - 20:03:26 UTC pid 55778 — pg_database_size() / pg_ls_waldir() / pg_class size query ──────────────────────────────────────────── WHAT WE'VE ALREADY RULED OUT ──────────────────────────────────────────── - No replication configured: synchronous_standby_names is empty, pg_stat_replication returns 0 rows. - No archiving configured: pg_stat_archiver counters are 0/null. - No lock contention: pg_stat_activity shows no Lock waits during hangs; no idle-in-transaction or prepared-transaction holds. - No application-side issue: this is reproducible from any client; staging and other environments running the same application code are unaffected. - CPU and memory metrics on the DB service are normal and low. - DB is tiny (150 MB), so this is not a capacity, bloat, or query-plan issue. The pattern is consistent with underlying volume / host I/O degradation specific to this instance. ──────────────────────────────────────────── WHAT WE NEED ──────────────────────────────────────────── 1. Recover the service. We strongly suspect Postgres may be stuck in slow WAL recovery on the degraded volume; please confirm whether that's the case and whether it's progressing. 2. Investigate the underlying volume / host. The pre-restart wait-event pattern and your own dashboard queries timing out point at the storage layer rather than Postgres itself. 3. If the volume / host is unhealthy, please migrate the existing volume to a healthy host rather than recreating it from scratch — we need the ~150 MB of data preserved. 4. Please do NOT have us redeploy or take further actions on the service until you've had a chance to look — interrupting WAL recovery on a sick disk can compound the damage. ──────────────────────────────────────────── CONTACT ──────────────────────────────────────────── Best contact: russ@shmoodyapp.com Available now and through resolution. Thanks — happy to provide additional pg_stat_activity captures, full logs, or read access to the project as needed.

Postgres Service won't restart - Railway Central Station

Postgres Service won't restart

russriser

PROOP

a month ago

Account: jrconsole@gmail.com

Project: Campus Pulse

Service: Postgres DB

Plan: Pro

Region: US East

Started: 2026-05-28 ~20:13 UTC (restart action), problems pre-existed for hours

────────────────────────────────────────────

CURRENT STATE (production fully down)

────────────────────────────────────────────

After clicking Restart on our production Postgres service at ~20:05 UTC, the

service has not come back up. The dashboard reports "service isn't running or is

in an unexpected state" and our app cannot connect.

We have not modified or deleted the volume. ~150 MB of data lives there and we

need it preserved.

────────────────────────────────────────────

WHY WE RESTARTED — pre-existing severe I/O degradation

────────────────────────────────────────────

For [hours / days] before the restart, this Postgres instance has been

exhibiting severe storage-layer slowness on writes:

Database is ~150 MB total. Volume usage shows $0.10. Nowhere near capacity.
A trivial single-row UPDATE on a small table consistently takes 4–6 minutes

to complete, reproducible on every write.
pg_stat_activity for the running UPDATE shows the session sitting on:
1. LWLock / WALWrite (~90 seconds)
2. IO / WALWrite
3. IO / WALSync (until completion at ~5 minutes)
i.e., 100% of the time is in WAL durability operations, not query execution.
pg_database_size() and pg_ls_waldir() (pure filesystem stat() calls) time

out — strong indicator of filesystem-level unresponsiveness.
Your own /* railway:dataui */ dashboard queries are also timing out at the

30s statement_timeout against this DB. Examples in our logs from

20:01–20:07 UTC on 2026-05-28:
- 20:01:58 UTC pid 55706 — pg_stat_activity / pg_stat_database aggregation
- 20:02:54 UTC pid 55774 — pg_stat_user_indexes / pg_stat_user_tables query
- 20:03:26 UTC pid 55778 — pg_database_size() / pg_ls_waldir() / pg_class size query

────────────────────────────────────────────

WHAT WE'VE ALREADY RULED OUT

────────────────────────────────────────────

No replication configured: synchronous_standby_names is empty,

pg_stat_replication returns 0 rows.
No archiving configured: pg_stat_archiver counters are 0/null.
No lock contention: pg_stat_activity shows no Lock waits during hangs;

no idle-in-transaction or prepared-transaction holds.
No application-side issue: this is reproducible from any client; staging

and other environments running the same application code are unaffected.
CPU and memory metrics on the DB service are normal and low.
DB is tiny (150 MB), so this is not a capacity, bloat, or query-plan issue.

The pattern is consistent with underlying volume / host I/O degradation

specific to this instance.

────────────────────────────────────────────

WHAT WE NEED

────────────────────────────────────────────

Recover the service. We strongly suspect Postgres may be stuck in slow WAL

recovery on the degraded volume; please confirm whether that's the case

and whether it's progressing.
Investigate the underlying volume / host. The pre-restart wait-event

pattern and your own dashboard queries timing out point at the storage

layer rather than Postgres itself.
If the volume / host is unhealthy, please migrate the existing volume to

a healthy host rather than recreating it from scratch — we need the

~150 MB of data preserved.
Please do NOT have us redeploy or take further actions on the service

until you've had a chance to look — interrupting WAL recovery on a sick

disk can compound the damage.

────────────────────────────────────────────

CONTACT

────────────────────────────────────────────

Best contact: russ@shmoodyapp.com

Available now and through resolution.

Thanks — happy to provide additional pg_stat_activity captures, full logs,

or read access to the project as needed.

Solved

4 Replies

Status changed to Awaiting Railway Response Railway • 26 days ago

mykal

EMPLOYEE

a month ago

We identified a NIC issue on the host your Postgres database was running on and have since resolved it. Your service should be recovering now.

Going forward, if you want to protect against single-host failures like this, we offer a high-availability Postgres option that can convert your existing service into a multi-node cluster backed by Patroni, etcd, and HAProxy with automatic failover. You can set it up from your Postgres service under Settings > High Availability.

Status changed to Awaiting User Response Railway • 26 days ago

Status changed to Solved mykal • 26 days ago

mykal

We identified a NIC issue on the host your Postgres database was running on and have since resolved it. Your service should be recovering now. Going forward, if you want to protect against single-host failures like this, we offer a [high-availability Postgres](https://docs.railway.com/databases/postgresql-ha) option that can convert your existing service into a multi-node cluster backed by Patroni, etcd, and HAProxy with automatic failover. You can set it up from your Postgres service under Settings > High Availability.

russriser

PROOP

a month ago

Hi there, the service is still down and I'm wondering if I just need to restart, or if that might interrupt the recovery.

Status changed to Awaiting Railway Response Railway • 26 days ago

mykal

EMPLOYEE

a month ago

Hello!

It looked healthy when I took a peak; but I redeployed your service just to be safe; I can see in the logs that it's ready to accept connections now. Please let me know if you run into any other problems.

Status changed to Awaiting User Response Railway • 26 days ago

Status changed to Solved mykal • 26 days ago

mykal

Hello! It looked healthy when I took a peak; but I redeployed your service just to be safe; I can see in the logs that it's ready to accept connections now. Please let me know if you run into any other problems.

russriser

PROOP

a month ago

Looks good now! Thanks!

Status changed to Awaiting Railway Response Railway • 26 days ago

Status changed to Solved russriser • 26 days ago

Welcome!