15 days ago
Hi Railway Support,
Our API service running in the US East (Virginia) region is experiencing completely silent crashes. The container is being killed repeatedly, and we've ruled out all application-side causes. We suspect a platform-level SIGKILL or orchestrator issue.
The Symptoms:
The container terminates silently and restarts ("Starting Container" appears in deploy logs).
Example: Feb 26, 17:57–17:58 UTC (4 rapid restarts), and 18:21 UTC.
Postgres logs show 10+ simultaneous
Connection reset by peererrors at the exact moment of the crash, confirming the container was abruptly destroyed while holding active DB connections.NONE of our Node.js exit handlers fire (we trap
SIGTERM,SIGINT,uncaughtException,unhandledRejection,exit, andbeforeExit). This confirms it is aSIGKILLfrom outside the process.
What We Have Ruled Out:
Not Application OOM: We log
process.memoryUsage()every 10 seconds. Memory is rock-solid flat at 22–30 MB Heap, ~100MB RSS (our Node V8 limit is explicitly capped at 800MB). There are no memory spikes.Not Container OOM: We are well under our provisioned RAM limits.
Not Healthchecks: We completely removed healthchecks from
railway.json to ensure the Railway proxy wasn't killing the container due to a timeout and
Not DB Load: Our DB is only 19MB total. Connections are strictly pooled (10 max) with 10s timeouts.
Questions for your team:
Can you check your internal orchestrator/hypervisor telemetry for Service ID
4b869ea8-a4db-4946-b9f8-5e555c9824f8surrounding 2026-02-26 17:57–18:26 UTC?Specifically, what is the exact kernel/hypervisor event that issued the
SIGKILLduring those timestamps? (Was it a host-level OOM killer, CPU throttle, or a Nomad/K8s pod eviction/rebalance?)Are there any known noisy-neighbor or hardware stability issues on the host assigned to this workload in US East?
Any visibility you can provide from the host side would be greatly appreciated, as our application logs confirm the process is healthy right up until the SIGKILL.
Thanks
1 Replies
14 days ago
This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.
Status changed to Open Railway • 14 days ago
11 days ago
One thing worth checking is ephemeral disk exhaustion inside the container.
On Railway the container filesystem is limited, and if the disk fills up (often in /tmp, logs, or cached files) the process can terminate abruptly without triggering Node exit handlers. When that happens you often see exactly the pattern you're describing:
• sudden container restart
• database connections reset by peer
• no SIGTERM/SIGINT handlers firing
• memory metrics remaining stable
This is because the kernel aborts operations once the filesystem cannot allocate space.
You can confirm this by logging disk usage periodically:
df -h
or checking whether anything is writing heavily to /tmp or local storage.
Another thing to verify is the file descriptor limit:
ulimit -n
If the service opens many sockets (HTTP + Postgres pool + internal networking) hitting the FD limit can also terminate the process without a graceful shutdown.
Since the crashes happen under active DB connections, FD exhaustion or disk pressure would match the symptoms more closely than a platform SIGKILL.