Persistent "FATAL: the database system is starting up" error on PostgreSQL service
vince13
HOBBYOP

13 days ago

Hello Railway Team,

I am reaching out regarding a recurring issue with my PostgreSQL database service.

Frequently, the database becomes unavailable and throws the following error:

psql: error: connection to server at "localhost" (::1), port 5432 failed: FATAL: the database system is starting up

I have checked my resource usage, and it does not appear to be a storage issue—we are currently using less than 1GB of the 5GB allocated volume. Despite having plenty of space, the database seems to enter a recovery/startup loop unexpectedly.

This is causing downtime for my application. Could you please investigate if there are underlying infrastructure issues or suggest a permanent configuration fix to prevent the database from frequently entering this "starting up" state?

Looking forward to your recommendation for a long-term solution.

Best regards,

Attachments

$10 Bounty

4 Replies

Status changed to Open Railway 13 days ago


This happens after the database crashes and Postgres starts replaying the write-ahead log (WAL). You see, when you commit data to a Postgres database, the only thing which is immediately saved to disk is the WAL. The actual table changes are only applied to the in-memory buffers, and won't be permanently saved to disk until the next checkpoint.

So, when your database crashes for some reason, and everything in memory is lost, the next time it starts up, it needs to resort to replaying the WAL in order to get the tables back to the correct state.

Since it's not an out of volume space issue, it could be crashing because of OOM, or you might have serverless enabled, which might cause it to suddenly shut down. Can you share your database metrics?


Status changed to Awaiting User Response Railway 13 days ago


i-smuglov
FREETop 5% Contributor

13 days ago

Please check if it is in a serverless mode. If it is - your app tried to connect to Postgres before it was ready to accept connections — a classic race condition during cold starts or database restarts.

I have made a small async function to wake it up before write, just let me know if you need it.


Status changed to Awaiting Railway Response Railway 13 days ago


Status changed to Awaiting User Response Railway 13 days ago


13 days ago

Hey Vince,

That error is usually a symptom, not the root cause.

FATAL: the database system is starting up means Postgres is currently in its startup/recovery phase and is not ready to accept connections yet. The important question is why the Postgres service keeps getting back into that state.

Since you already confirmed that volume storage is below the limit, I would check these in this order:

1. Check whether the PostgreSQL service is restarting or crashing

Open the Postgres service → Deployments/Logs and look around the exact time the app goes down.

Search for messages like:

- database system was interrupted

- automatic recovery in progress

- redo starts / redo done

- database system is ready to accept connections

- received fast shutdown request

- out of memory / killed / OOM

- container restarted / crashed

If you see repeated recovery messages, then the starting up error is only what your app sees while Postgres is recovering. The actual issue is the restart/crash that happened just before it.

2. Check memory and CPU, not only disk usage

Storage being under 1GB is good, but this can still happen if the Postgres container is being killed because of memory pressure or heavy queries/connections.

In the Metrics tab, check the time window where the error happened and look for:

- memory spikes near the plan limit

- CPU spikes

- restarts

- too many open connections

- long-running queries or sudden traffic bursts

If memory is spiking, the fix may be reducing connection count, adding pooling, optimizing heavy queries, or upgrading resources.

3. Check if Serverless/Sleep mode is enabled

If the database or dependent app is sleeping, the app can try to connect while Postgres is still waking up or recovering. That creates a cold-start race condition where the first connection attempt fails.

For a production database that needs stable uptime, I would avoid serverless/sleep mode and keep the database running continuously.

4. Double-check the connection host

Your error shows localhost (::1):5432.

That is fine only if you are running psql from inside the PostgreSQL container itself or through a local Railway command/proxy.

But if your application service is trying to connect to localhost, then it may be misconfigured. In Railway, the app should normally connect using the Postgres service variables, usually:

DATABASE_URL=${{Postgres.DATABASE_URL}}

or the individual variables:

PGHOST, PGPORT, PGUSER, PGPASSWORD, PGDATABASE.

After changing variables, make sure the app service is redeployed so the new values are applied.

5. Add retry/backoff in the application

Even with the correct Railway configuration, the app should not permanently fail just because Postgres is unavailable for a few seconds during restart/recovery.

On startup and during connection acquisition, add retry logic with backoff, for example:

- retry every 2–5 seconds

- continue for 60–120 seconds

- only fail after the retries are exhausted

- log each failed attempt clearly

This prevents short recovery windows from becoming full app downtime.

6. Add an application healthcheck that depends on DB readiness

If this is an API service, add a /health or /ready endpoint that only returns success when the app can actually connect to Postgres. Then configure Railway’s healthcheck path for the app service.

That way the app is not treated as ready before the database connection is usable.

Recommended long-term fix:

- Disable serverless/sleep mode for the database if enabled.

- Make sure the app is using Railway’s Postgres DATABASE_URL or referenced Postgres variables, not hardcoded localhost.

- Check Postgres logs for the event immediately before recovery starts.

- Check Metrics for memory/CPU spikes and restarts.

- Add connection retry/backoff in the app.

- If memory is the issue, reduce connection count or add pooling.

- If logs show repeated recovery/crashes without memory pressure, high CPU, or serverless sleep, then this likely needs Railway staff to inspect the underlying service/volume/host.

So the main thing I would investigate is not the FATAL: database system is starting up line itself. I would investigate what is forcing Postgres to restart or recover repeatedly.


Status changed to Awaiting Railway Response Railway 13 days ago


Status changed to Awaiting User Response Railway 13 days ago


vince13
HOBBYOP

12 days ago

Hello,

Following up with the logs you requested. You can see at 2026-05-09 14:32:20.960 UTC that the database reports:

"database system was not properly shut down; automatic recovery in progress"

The logs show frequent checkpoints but no recorded "smart" or "fast" shutdown signals, implying the process is being terminated externally.

I have attached the Log andMetrics screenshotsfor your review:

Can you confirm if this is being caused by an OOM (Out of Memory) kill or if the Serverless feature is forcing a shutdown during periods of perceived inactivity?

Best regards,


Status changed to Awaiting Railway Response Railway 12 days ago


Welcome!

Sign in to your Railway account to join the conversation.

Loading...