a month ago
Account: jrconsole@gmail.com
Project: Campus Pulse
Service: Postgres DB
Plan: Pro
Region: US East
Started: 2026-05-28 ~20:13 UTC (restart action), problems pre-existed for hours
────────────────────────────────────────────
CURRENT STATE (production fully down)
────────────────────────────────────────────
After clicking Restart on our production Postgres service at ~20:05 UTC, the
service has not come back up. The dashboard reports "service isn't running or is
in an unexpected state" and our app cannot connect.
We have not modified or deleted the volume. ~150 MB of data lives there and we
need it preserved.
────────────────────────────────────────────
WHY WE RESTARTED — pre-existing severe I/O degradation
────────────────────────────────────────────
For [hours / days] before the restart, this Postgres instance has been
exhibiting severe storage-layer slowness on writes:
-
Database is ~150 MB total. Volume usage shows $0.10. Nowhere near capacity.
-
A trivial single-row UPDATE on a small table consistently takes 4–6 minutes
to complete, reproducible on every write.
-
pg_stat_activity for the running UPDATE shows the session sitting on:
- LWLock / WALWrite (~90 seconds)
- IO / WALWrite
- IO / WALSync (until completion at ~5 minutes)
i.e., 100% of the time is in WAL durability operations, not query execution.
-
pg_database_size() and pg_ls_waldir() (pure filesystem stat() calls) time
out — strong indicator of filesystem-level unresponsiveness.
-
Your own /* railway:dataui */ dashboard queries are also timing out at the
30s statement_timeout against this DB. Examples in our logs from
20:01–20:07 UTC on 2026-05-28:
- 20:01:58 UTC pid 55706 — pg_stat_activity / pg_stat_database aggregation
- 20:02:54 UTC pid 55774 — pg_stat_user_indexes / pg_stat_user_tables query
- 20:03:26 UTC pid 55778 — pg_database_size() / pg_ls_waldir() / pg_class size query
────────────────────────────────────────────
WHAT WE'VE ALREADY RULED OUT
────────────────────────────────────────────
-
No replication configured: synchronous_standby_names is empty,
pg_stat_replication returns 0 rows.
-
No archiving configured: pg_stat_archiver counters are 0/null.
-
No lock contention: pg_stat_activity shows no Lock waits during hangs;
no idle-in-transaction or prepared-transaction holds.
-
No application-side issue: this is reproducible from any client; staging
and other environments running the same application code are unaffected.
-
CPU and memory metrics on the DB service are normal and low.
-
DB is tiny (150 MB), so this is not a capacity, bloat, or query-plan issue.
The pattern is consistent with underlying volume / host I/O degradation
specific to this instance.
────────────────────────────────────────────
WHAT WE NEED
────────────────────────────────────────────
-
Recover the service. We strongly suspect Postgres may be stuck in slow WAL
recovery on the degraded volume; please confirm whether that's the case
and whether it's progressing.
-
Investigate the underlying volume / host. The pre-restart wait-event
pattern and your own dashboard queries timing out point at the storage
layer rather than Postgres itself.
-
If the volume / host is unhealthy, please migrate the existing volume to
a healthy host rather than recreating it from scratch — we need the
~150 MB of data preserved.
-
Please do NOT have us redeploy or take further actions on the service
until you've had a chance to look — interrupting WAL recovery on a sick
disk can compound the damage.
────────────────────────────────────────────
CONTACT
────────────────────────────────────────────
Best contact: russ@shmoodyapp.com
Available now and through resolution.
Thanks — happy to provide additional pg_stat_activity captures, full logs,
or read access to the project as needed.
4 Replies
Status changed to Awaiting Railway Response Railway • 26 days ago
a month ago
We identified a NIC issue on the host your Postgres database was running on and have since resolved it. Your service should be recovering now.
Going forward, if you want to protect against single-host failures like this, we offer a high-availability Postgres option that can convert your existing service into a multi-node cluster backed by Patroni, etcd, and HAProxy with automatic failover. You can set it up from your Postgres service under Settings > High Availability.
Status changed to Awaiting User Response Railway • 26 days ago
Status changed to Solved mykal • 26 days ago
mykal
We identified a NIC issue on the host your Postgres database was running on and have since resolved it. Your service should be recovering now. Going forward, if you want to protect against single-host failures like this, we offer a [high-availability Postgres](https://docs.railway.com/databases/postgresql-ha) option that can convert your existing service into a multi-node cluster backed by Patroni, etcd, and HAProxy with automatic failover. You can set it up from your Postgres service under Settings > High Availability.
a month ago
Hi there, the service is still down and I'm wondering if I just need to restart, or if that might interrupt the recovery.
Status changed to Awaiting Railway Response Railway • 26 days ago
a month ago
Hello!
It looked healthy when I took a peak; but I redeployed your service just to be safe; I can see in the logs that it's ready to accept connections now. Please let me know if you run into any other problems.
Status changed to Awaiting User Response Railway • 26 days ago
Status changed to Solved mykal • 26 days ago
mykal
Hello! It looked healthy when I took a peak; but I redeployed your service just to be safe; I can see in the logs that it's ready to accept connections now. Please let me know if you run into any other problems.
a month ago
Looks good now! Thanks!
Status changed to Awaiting Railway Response Railway • 26 days ago
Status changed to Solved russriser • 26 days ago
