Postgres Service won't restart
russriser
PROOP

a month ago

Account: jrconsole@gmail.com

Project: Campus Pulse

Service: Postgres DB

Plan: Pro

Region: US East

Started: 2026-05-28 ~20:13 UTC (restart action), problems pre-existed for hours

────────────────────────────────────────────

CURRENT STATE (production fully down)

────────────────────────────────────────────

After clicking Restart on our production Postgres service at ~20:05 UTC, the

service has not come back up. The dashboard reports "service isn't running or is

in an unexpected state" and our app cannot connect.

We have not modified or deleted the volume. ~150 MB of data lives there and we

need it preserved.

────────────────────────────────────────────

WHY WE RESTARTED — pre-existing severe I/O degradation

────────────────────────────────────────────

For [hours / days] before the restart, this Postgres instance has been

exhibiting severe storage-layer slowness on writes:

  • Database is ~150 MB total. Volume usage shows $0.10. Nowhere near capacity.

  • A trivial single-row UPDATE on a small table consistently takes 4–6 minutes

    to complete, reproducible on every write.

  • pg_stat_activity for the running UPDATE shows the session sitting on:

    1. LWLock / WALWrite (~90 seconds)
    2. IO / WALWrite
    3. IO / WALSync (until completion at ~5 minutes)

    i.e., 100% of the time is in WAL durability operations, not query execution.

  • pg_database_size() and pg_ls_waldir() (pure filesystem stat() calls) time

    out — strong indicator of filesystem-level unresponsiveness.

  • Your own /* railway:dataui */ dashboard queries are also timing out at the

    30s statement_timeout against this DB. Examples in our logs from

    20:01–20:07 UTC on 2026-05-28:

    • 20:01:58 UTC pid 55706 — pg_stat_activity / pg_stat_database aggregation
    • 20:02:54 UTC pid 55774 — pg_stat_user_indexes / pg_stat_user_tables query
    • 20:03:26 UTC pid 55778 — pg_database_size() / pg_ls_waldir() / pg_class size query

────────────────────────────────────────────

WHAT WE'VE ALREADY RULED OUT

────────────────────────────────────────────

  • No replication configured: synchronous_standby_names is empty,

    pg_stat_replication returns 0 rows.

  • No archiving configured: pg_stat_archiver counters are 0/null.

  • No lock contention: pg_stat_activity shows no Lock waits during hangs;

    no idle-in-transaction or prepared-transaction holds.

  • No application-side issue: this is reproducible from any client; staging

    and other environments running the same application code are unaffected.

  • CPU and memory metrics on the DB service are normal and low.

  • DB is tiny (150 MB), so this is not a capacity, bloat, or query-plan issue.

The pattern is consistent with underlying volume / host I/O degradation

specific to this instance.

────────────────────────────────────────────

WHAT WE NEED

────────────────────────────────────────────

  1. Recover the service. We strongly suspect Postgres may be stuck in slow WAL

    recovery on the degraded volume; please confirm whether that's the case

    and whether it's progressing.

  2. Investigate the underlying volume / host. The pre-restart wait-event

    pattern and your own dashboard queries timing out point at the storage

    layer rather than Postgres itself.

  3. If the volume / host is unhealthy, please migrate the existing volume to

    a healthy host rather than recreating it from scratch — we need the

    ~150 MB of data preserved.

  4. Please do NOT have us redeploy or take further actions on the service

    until you've had a chance to look — interrupting WAL recovery on a sick

    disk can compound the damage.

────────────────────────────────────────────

CONTACT

────────────────────────────────────────────

Best contact: russ@shmoodyapp.com

Available now and through resolution.

Thanks — happy to provide additional pg_stat_activity captures, full logs,

or read access to the project as needed.

Solved

4 Replies

Status changed to Awaiting Railway Response Railway 26 days ago


a month ago

We identified a NIC issue on the host your Postgres database was running on and have since resolved it. Your service should be recovering now.

Going forward, if you want to protect against single-host failures like this, we offer a high-availability Postgres option that can convert your existing service into a multi-node cluster backed by Patroni, etcd, and HAProxy with automatic failover. You can set it up from your Postgres service under Settings > High Availability.


Status changed to Awaiting User Response Railway 26 days ago


Status changed to Solved mykal 26 days ago


mykal

We identified a NIC issue on the host your Postgres database was running on and have since resolved it. Your service should be recovering now. Going forward, if you want to protect against single-host failures like this, we offer a [high-availability Postgres](https://docs.railway.com/databases/postgresql-ha) option that can convert your existing service into a multi-node cluster backed by Patroni, etcd, and HAProxy with automatic failover. You can set it up from your Postgres service under Settings > High Availability.

russriser
PROOP

a month ago

Hi there, the service is still down and I'm wondering if I just need to restart, or if that might interrupt the recovery.


Status changed to Awaiting Railway Response Railway 26 days ago


a month ago

Hello!

It looked healthy when I took a peak; but I redeployed your service just to be safe; I can see in the logs that it's ready to accept connections now. Please let me know if you run into any other problems.


Status changed to Awaiting User Response Railway 26 days ago


Status changed to Solved mykal 26 days ago


mykal

Hello! It looked healthy when I took a peak; but I redeployed your service just to be safe; I can see in the logs that it's ready to accept connections now. Please let me know if you run into any other problems.

russriser
PROOP

a month ago

Looks good now! Thanks!


Status changed to Awaiting Railway Response Railway 26 days ago


Status changed to Solved russriser 26 days ago


Welcome!

Sign in to your Railway account to join the conversation.

Loading...