a month ago
Unexplained Instance Terminations
Summary
Our Node.js/Next.js app experiences ~2 unexplained terminations/day. Process killed externally with no error logged. Correlated with PostgreSQL connection resets at exact timestamps.
Environment
- Service: polymode-server (Next.js custom server with Socket.io)
- Region: Amsterdam EU
- Memory Limit: 2GB increased to 6GB
- Observed Memory Usage: 60-70MB heap, 130-155MB RSS (well under limit)
- Database: Railway PostgreSQL (same project)
Issue Description
Server instances are terminated without any error, warning, or signal logged. We have comprehensive error handling including process.on('uncaughtException') and process.on('unhandledRejection') handlers, plus memory logging every 5 minutes. None fire before termination, suggesting the process is killed externally (SIGKILL).
Evidence
Crash #1: 2026-01-17 ~10:01 UTC
Application Logs (before crash):
09:24:16 [Socket.io] Cleaned up resources for socket: HWHVqDviEUdhlZgpAAAD
09:28:46 [Memory] Heap: 67MB / 69MB, RSS: 149MB
09:33:46 [Memory] Heap: 67MB / 69MB, RSS: 149MB
09:38:47 [Memory] Heap: 67MB / 70MB, RSS: 150MB
09:43:46 [Memory] Heap: 67MB / 70MB, RSS: 150MB
09:48:46 [Memory] Heap: 67MB / 70MB, RSS: 150MB
09:53:47 [Memory] Heap: 67MB / 70MB, RSS: 150MB
09:58:47 [Memory] Heap: 67MB / 70MB, RSS: 150MB
10:01:47 [Socket Module] Initial io value: false <-- NEW INSTANCE STARTING
10:01:47 Prisma schema loaded from prisma/schema.prisma
Key observations:
- Memory completely stable at 67MB heap, 150MB RSS
- Server idle for ~37 minutes (last socket disconnect at 09:24)
- No error logged between last memory log (09:58) and restart (10:01)
PostgreSQL Logs (correlated timestamp):
10:01:34 [59141] LOG: could not receive data from client: Connection reset by peer
10:01:34 [59143] LOG: could not receive data from client: Connection reset by peer
10:01:34 [59139] LOG: could not accept SSL connection: EOF detected
10:01:34 [59138] LOG: could not receive data from client: Connection reset by peer
10:01:34 [59142] LOG: could not receive data from client: Connection reset by peer
10:01:34 [59140] LOG: could not receive data from client: Connection reset by peer
10:01:34 [59066] LOG: could not receive data from client: Connection reset by peer
10:01:34 [59137] LOG: could not receive data from client: Connection reset by peer
10:01:34 [59136] LOG: could not receive data from client: Connection reset by peer
10:01:34 [59067] LOG: could not receive data from client: Connection reset by peer
Timeline: DB sees 10 connections drop at 10:01:34 -> New server instance starts 10:01:47 (~13 seconds later)
Crash #2: 2026-01-17 ~08:53 UTC
PostgreSQL Logs:
08:53:31 [59048] LOG: could not receive data from client: Connection reset by peer
(14 connection resets total)
Application Logs:
08:53:46 > Server listening at http://0.0.0.0:8080 as production
Timeline: DB sees 14 connections drop at 08:53:31 -> New server instance starts 08:53:46 (~15 seconds later)
What We've Ruled Out
1. Memory issues: Heap stable at 60-70MB, RSS at 130-155MB (limit is 2GB)
2. Application crashes: No uncaught exceptions or unhandled rejections logged
3. Health check timeouts: Per Railway docs, health checks only run at deployment time
4. Database issues: PostgreSQL was healthy; connection resets are a symptom of our server being killed
5. Code deployments: No deployments were in progress during these terminations
Our Error Handling Code (none of this fired):
process.on('uncaughtException', (error) => {
console.error('[FATAL] Uncaught Exception:', error);
console.error('[FATAL] Stack:', error.stack);
setTimeout(() => process.exit(1), 1000);
});
process.on('unhandledRejection', (reason, promise) => {
console.error('[FATAL] Unhandled Rejection at:', promise);
console.error('[FATAL] Reason:', reason);
});
Questions for Railway
1. Can you see why our instance was terminated at these timestamps?
- 2026-01-17T10:01:34 UTC
- 2026-01-17T08:53:31 UTC
2. Is there idle instance management or resource rebalancing that could cause this?
3. Are there platform-level watchdogs or timeouts we should be aware of?
The PostgreSQL connection resets points to the server killed externally. We need help understanding why.
We have a training program coming up and am super concerned about stability.
Pinned Solution
a month ago
Have found a Prisma/Alpine/OpenSSL segfault issue which matches my issue. I am hoping a switch to a different docker base image is going to help. I will monitor and report back.
14 Replies
a month ago
This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.
Status changed to Open Railway • 27 days ago
a month ago
UPDATE: Added signal handlers (SIGTERM, SIGINT, SIGHUP) and exit handlers to diagnose termination cause. None fired before crashes, confirming process receives SIGKILL.
New crash: 2026-01-18 ~08:57:20 UTC
- Last log: 08:57:20 [Memory] Heap: 64MB / 71MB, RSS: 142MB
- New instance started: 08:57:39
- No SIGTERM, no exit handler fired - process killed instantly
SIGKILL cannot be caught by application code.
It appears Railway is sending SIGKILL (or triggering OOM killer) instead of SIGTERM for graceful shutdown -- anyone else experiencing?
douefranck
what is your Custom Start Command in Railway?
a month ago
Custom Start Command: pnpm start (which runs node server.cjs)
Railway.toml config:
[deploy]
startCommand = "pnpm start"
healthcheckPath = "/api/health"
healthcheckTimeout = 300
restartPolicyType = "ON_FAILURE"
restartPolicyMaxRetries = 3
Serverless is disabled
a month ago
Have just changed to:
startCommand = "node server.cjs"
Will test and observe
a month ago
good news , switching to node server.cjs was the right move. that's officially documented here: https://docs.railway.com/guides/nodejs-sigterm
pnpm start was definitely swallowing the sigterm signal, which explains why none of your handlers fired. now your app will actually receive the shutdown signal and can close db connections properly
test it for 24-48 hours. if the crashes stop, problem solved, it was the signal handling
if crashes continue even with node server.cjs, then something else is killing your instance
the postgres connection resets you saw were just symptoms of your app being killed without gracefully closing connections. i think direct node startup fixes that
douefranck
good news , switching to node server.cjs was the right move. that's officially documented here: https://docs.railway.com/guides/nodejs-sigtermpnpm start was definitely swallowing the sigterm signal, which explains why none of your handlers fired. now your app will actually receive the shutdown signal and can close db connections properlytest it for 24-48 hours. if the crashes stop, problem solved, it was the signal handlingif crashes continue even with node server.cjs, then something else is killing your instancethe postgres connection resets you saw were just symptoms of your app being killed without gracefully closing connections. i think direct node startup fixes that
a month ago
Thanks very much for the valuable input here. I will observe and report back as soon as I have clarity about the state of the crashing.
a month ago
Update: Changed start command to node server.cjs (direct, no pnpm wrapper). Crashes persist with no SIGTERM/SIGINT/exit handlers firing.
[Memory] Heap: 71MB / 74MB, RSS: 150MB
Prisma schema loaded from prisma/schema.prisma
... (new instance starting - no signal logged before this)
Serverless is NOT enabled. Memory stable at 71MB (6GB limit).
We have handlers for SIGTERM, SIGINT, SIGHUP, and exit - none fire. This confirms SIGKILL with no opportunity for graceful shutdown.
Railway team, it appears we've ruled out internal reasons for failure. Can you check your internal logs for why our instance is being terminated? The process is almost certainly being killed externally.
a month ago
We are not killing the app for any reason, please continue to debug this with the community.
brody
We are not killing the app for any reason, please continue to debug this with the community.
a month ago
Thanks for the response. Will continue to troubleshoot for now -- I'm currently looking at known faults with Prisma
a month ago
Have found a Prisma/Alpine/OpenSSL segfault issue which matches my issue. I am hoping a switch to a different docker base image is going to help. I will monitor and report back.
a month ago
I will say that I have seen Alpine images cause a good deal of instability, mainly in the networking stack, but I wouldn't be surprised if the instabilities extended beyond that. In all cases, switching to a Debian-based image solved the user's issues.
25 days ago
Thank you @Brody and @douefranck for the input on this. It appears now after 16 hours of stability that the root cause was Alpine causing crashes. Appreciate your help.
Status changed to Solved brody • 25 days ago


