Unexplained Instance Terminations

apophenist

PROOP

6 months ago

Summary

Our Node.js/Next.js app experiences ~2 unexplained terminations/day. Process killed externally with no error logged. Correlated with PostgreSQL connection resets at exact timestamps.

Environment

- Service: polymode-server (Next.js custom server with Socket.io)

- Region: Amsterdam EU

- Memory Limit: 2GB increased to 6GB

- Observed Memory Usage: 60-70MB heap, 130-155MB RSS (well under limit)

- Database: Railway PostgreSQL (same project)

Issue Description

Server instances are terminated without any error, warning, or signal logged. We have comprehensive error handling including process.on('uncaughtException') and process.on('unhandledRejection') handlers, plus memory logging every 5 minutes. None fire before termination, suggesting the process is killed externally (SIGKILL).

Evidence

Crash #1: 2026-01-17 ~10:01 UTC

Application Logs (before crash):

09:24:16 [Socket.io] Cleaned up resources for socket: HWHVqDviEUdhlZgpAAAD

09:28:46 [Memory] Heap: 67MB / 69MB, RSS: 149MB

09:33:46 [Memory] Heap: 67MB / 69MB, RSS: 149MB

09:38:47 [Memory] Heap: 67MB / 70MB, RSS: 150MB

09:43:46 [Memory] Heap: 67MB / 70MB, RSS: 150MB

09:48:46 [Memory] Heap: 67MB / 70MB, RSS: 150MB

09:53:47 [Memory] Heap: 67MB / 70MB, RSS: 150MB

09:58:47 [Memory] Heap: 67MB / 70MB, RSS: 150MB

10:01:47 [Socket Module] Initial io value: false <-- NEW INSTANCE STARTING

10:01:47 Prisma schema loaded from prisma/schema.prisma

Key observations:

- Memory completely stable at 67MB heap, 150MB RSS

- Server idle for ~37 minutes (last socket disconnect at 09:24)

- No error logged between last memory log (09:58) and restart (10:01)

PostgreSQL Logs (correlated timestamp):

10:01:34 [59141] LOG: could not receive data from client: Connection reset by peer

10:01:34 [59143] LOG: could not receive data from client: Connection reset by peer

10:01:34 [59139] LOG: could not accept SSL connection: EOF detected

10:01:34 [59138] LOG: could not receive data from client: Connection reset by peer

10:01:34 [59142] LOG: could not receive data from client: Connection reset by peer

10:01:34 [59140] LOG: could not receive data from client: Connection reset by peer

10:01:34 [59066] LOG: could not receive data from client: Connection reset by peer

10:01:34 [59137] LOG: could not receive data from client: Connection reset by peer

10:01:34 [59136] LOG: could not receive data from client: Connection reset by peer

10:01:34 [59067] LOG: could not receive data from client: Connection reset by peer

Timeline: DB sees 10 connections drop at 10:01:34 -> New server instance starts 10:01:47 (~13 seconds later)

Crash #2: 2026-01-17 ~08:53 UTC

PostgreSQL Logs:

08:53:31 [59048] LOG: could not receive data from client: Connection reset by peer

(14 connection resets total)

Application Logs:

08:53:46 > Server listening at http://0.0.0.0:8080 as production

Timeline: DB sees 14 connections drop at 08:53:31 -> New server instance starts 08:53:46 (~15 seconds later)

What We've Ruled Out

1. Memory issues: Heap stable at 60-70MB, RSS at 130-155MB (limit is 2GB)

2. Application crashes: No uncaught exceptions or unhandled rejections logged

3. Health check timeouts: Per Railway docs, health checks only run at deployment time

4. Database issues: PostgreSQL was healthy; connection resets are a symptom of our server being killed

5. Code deployments: No deployments were in progress during these terminations

Our Error Handling Code (none of this fired):

process.on('uncaughtException', (error) => {

console.error('[FATAL] Uncaught Exception:', error);

console.error('[FATAL] Stack:', error.stack);

setTimeout(() => process.exit(1), 1000);

});

process.on('unhandledRejection', (reason, promise) => {

console.error('[FATAL] Unhandled Rejection at:', promise);

console.error('[FATAL] Reason:', reason);

});

Questions for Railway

1. Can you see why our instance was terminated at these timestamps?

- 2026-01-17T10:01:34 UTC

- 2026-01-17T08:53:31 UTC

2. Is there idle instance management or resource rebalancing that could cause this?

3. Are there platform-level watchdogs or timeouts we should be aware of?

The PostgreSQL connection resets points to the server killed externally. We need help understanding why.

We have a training program coming up and am super concerned about stability.

Solved$20 Bounty

Pinned Solution

apophenist

PROOP

6 months ago

Have found a Prisma/Alpine/OpenSSL segfault issue which matches my issue. I am hoping a switch to a different docker base image is going to help. I will monitor and report back.

14 Replies

Railway

BOT

6 months ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open Railway • 6 months ago

apophenist

PROOP

6 months ago

UPDATE: Added signal handlers (SIGTERM, SIGINT, SIGHUP) and exit handlers to diagnose termination cause. None fired before crashes, confirming process receives SIGKILL.

New crash: 2026-01-18 ~08:57:20 UTC

- Last log: 08:57:20 [Memory] Heap: 64MB / 71MB, RSS: 142MB

- New instance started: 08:57:39

- No SIGTERM, no exit handler fired - process killed instantly

SIGKILL cannot be caught by application code.

It appears Railway is sending SIGKILL (or triggering OOM killer) instead of SIGTERM for graceful shutdown -- anyone else experiencing?

douefranck

FREE

6 months ago

what is your Custom Start Command in Railway?

douefranck

FREE

6 months ago

and is Serverless enabled for your service?

douefranck

what is your Custom Start Command in Railway?

apophenist

PROOP

6 months ago

Custom Start Command: pnpm start (which runs node server.cjs)

Railway.toml config:

[deploy]

startCommand = "pnpm start"

healthcheckPath = "/api/health"

healthcheckTimeout = 300

restartPolicyType = "ON_FAILURE"

restartPolicyMaxRetries = 3

Serverless is disabled

apophenist

PROOP

6 months ago

Have just changed to:

startCommand = "node server.cjs"

Will test and observe

douefranck

FREE

6 months ago

good news , switching to node server.cjs was the right move. that's officially documented here: https://docs.railway.com/guides/nodejs-sigterm

pnpm start was definitely swallowing the sigterm signal, which explains why none of your handlers fired. now your app will actually receive the shutdown signal and can close db connections properly

test it for 24-48 hours. if the crashes stop, problem solved, it was the signal handling

if crashes continue even with node server.cjs, then something else is killing your instance

the postgres connection resets you saw were just symptoms of your app being killed without gracefully closing connections. i think direct node startup fixes that

douefranck

good news , switching to `node server.cjs` was the right move. that's officially documented here: <https://docs.railway.com/guides/nodejs-sigterm> `pnpm start` was definitely swallowing the sigterm signal, which explains why none of your handlers fired. now your app will actually receive the shutdown signal and can close db connections properly test it for 24-48 hours. if the crashes stop, problem solved, it was the signal handling if crashes continue even with `node server.cjs`, then something else is killing your instance the postgres connection resets you saw were just symptoms of your app being killed without gracefully closing connections. i think direct node startup fixes that

apophenist

PROOP

6 months ago

Thanks very much for the valuable input here. I will observe and report back as soon as I have clarity about the state of the crashing.

douefranck

FREE

6 months ago

no problem , good luck with the training program!

apophenist

PROOP

6 months ago

Update: Changed start command to node server.cjs (direct, no pnpm wrapper). Crashes persist with no SIGTERM/SIGINT/exit handlers firing.

[Memory] Heap: 71MB / 74MB, RSS: 150MB

Prisma schema loaded from prisma/schema.prisma

... (new instance starting - no signal logged before this)

Serverless is NOT enabled. Memory stable at 71MB (6GB limit).

We have handlers for SIGTERM, SIGINT, SIGHUP, and exit - none fire. This confirms SIGKILL with no opportunity for graceful shutdown.

Railway team, it appears we've ruled out internal reasons for failure. Can you check your internal logs for why our instance is being terminated? The process is almost certainly being killed externally.

brody

EMPLOYEE

6 months ago

We are not killing the app for any reason, please continue to debug this with the community.

brody

We are not killing the app for any reason, please continue to debug this with the community.

apophenist

PROOP

6 months ago

Thanks for the response. Will continue to troubleshoot for now -- I'm currently looking at known faults with Prisma

apophenist

PROOP

6 months ago

Have found a Prisma/Alpine/OpenSSL segfault issue which matches my issue. I am hoping a switch to a different docker base image is going to help. I will monitor and report back.

brody

EMPLOYEE

6 months ago

I will say that I have seen Alpine images cause a good deal of instability, mainly in the networking stack, but I wouldn't be surprised if the instabilities extended beyond that. In all cases, switching to a Debian-based image solved the user's issues.

apophenist

PROOP

6 months ago

Thank you @Brody and @douefranck for the input on this. It appears now after 16 hours of stability that the root cause was Alpine causing crashes. Appreciate your help.

Status changed to Solved brody • 6 months ago

Welcome!