Unexplained Node/NestJS backend restarts with ELIFECYCLE and no JS error – need exit code / native logs

paulkellis

PROOP

7 months ago

We are having on-going service restarts and don't have visibility into the root cause. My Cursor agent has created this summary and our ask.

-----------

Environment

Stack: NestJS backend (@savor/backend) running on Railway
DB: Railway PostgreSQL (postgres.railway.internal:5432)
Other services: Redis on Railway
Deploy time of current version: ~2025‑12‑01 20:03 PT

Symptoms

Backend process restarts periodically with logs like:

---

[Nest] 127 -12/01/2025, 4:21:27 AM LOG [NestApplication] Nest application successfully started +351ms

UsersController: syncing user data from frontend { auth0Id: '...', ... }

ELIFECYCLE Command failed.

Environment variables loaded from .env

Prisma schema loaded from prisma/schema.prisma

Datasource "db": PostgreSQL database "railway", schema "public" at "postgres.railway.internal:5432"

15 migrations found in prisma/migrations

No pending migrations to apply.

> @savor/backend@0.0.1 start:prod /app

> node dist/main

---

Pattern is consistent:

Service starts normally, maps all routes, logs “Nest application successfully started”.
On first or early /users/sync call (user sync endpoint), we see UsersController: syncing user data from frontend ….
Immediately after, ELIFECYCLE Command failed. appears, and Railway restarts the container.
There is no JS stack trace or error logged before the restart, even though we have:

---

process.on('uncaughtException', ...)

process.on('unhandledRejection', ...)

---

wired at the top of main.ts.

Mitigations already applied

To rule out application‑level causes, we have:

Disabled Redis user caching:
USER_CACHE_TTL=0 → guards/interceptors no longer call Redis get/setex/del.
Disabled Bull job queues:
Added BULL_ENABLED flag; when BULL_ENABLED=false we do not call BullModule.forRoot(...).
Disabled major AI workflows
Despite this, restarts continue with the same pattern, even when the app is mostly idle.

What I’m asking Railway for

For the restarts around these timestamps (example): 2025‑12‑01T04:21:27Z and 2025‑12‑01T04:30:58Z

on service backend / deployment IDs:

please provide the exact exit code and signal for the Node process (e.g. exit 1, exit 137, SIGKILL, etc.).
Check your lower‑level logs for that container for any native/runtime errors, such as: OOM kills, node or V8 crashes, Prisma/native client panics or segmentation faults, health‑check failures that cause you to terminate the process.
Confirm whether any resource limits or health checks are currently configured for this service that could be killing the process despite stable CPU/memory at the app level.

Having the exit code/signal and any native error message around those restart times will let me distinguish between:

A platform/runtime issue (OOM, crash, health‑check kill), vs.
Something inside my app/Prisma stack that isn’t visible through JS error handlers.

---

Thank you for helping us solve this confounding issue!

Solved$10 Bounty

Pinned Solution

rafchik-dipstick

HOBBY

7 months ago

Found working strategy for me - wrap docker like this inside Dockerfile

Wrap node to emit exit code even on hard failures
CMD ["sh", "-c", "node dist/server.js & pid=$!; wait $pid; code=$?; echo \"[WRAPPER] node exited status=$code\" >&2; sleep 1; exit $code"]

- got logs like this

Segmentation fault

[WRAPPER] node exited status=139

7 Replies

Railway

BOT

7 months ago

Hey there! We've found the following might help you get unblocked faster:

If you find the answer from one of these, please let us know by solving the thread!

ray-chen

EMPLOYEE

7 months ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open ray-chen • 7 months ago

caydennn

PRO

7 months ago

I'm facing this exact issue with almost a pretty much identical setup. NestJS, Prisma, Redis and BullMQ.

No error logs. Tried disabling BullMQ workers, even extracted some workers into its own separate service in case it was a load issue - still crashing intermittently with no trace or error logs.

I was just about to disable user caching, but after stumbling across your post , it doesn't seem like it'll do anything,

I just made a post here describing it too: https://station.railway.com/questions/nest-js-server-is-crashing-with-no-logs-a2d99b6d

Hope we find something helpful soon!

rafchik-dipstick

HOBBY

7 months ago

Same issue here - https://station.railway.com/questions/exit-code-c6aa8f4a

Railway doesn't provide, it seems , exit codes - trying to check my app for bursts - no exit codes , no nothing

Is it really only way to ask sup to get exit codes?

rafchik-dipstick

HOBBY

7 months ago

Found working strategy for me - wrap docker like this inside Dockerfile

Wrap node to emit exit code even on hard failures
CMD ["sh", "-c", "node dist/server.js & pid=$!; wait $pid; code=$?; echo \"[WRAPPER] node exited status=$code\" >&2; sleep 1; exit $code"]

- got logs like this

Segmentation fault

[WRAPPER] node exited status=139

paulkellis

PROOP

7 months ago

Reviewing feedback above. Thanks!

paulkellis

PROOP

7 months ago

Thanks @rafchik! I now have visibility on the failure.

What I'm seeing in the logs now is periodic restarts, preceded by:

Segmentation fault
[WRAPPER] backend exited with error status=139

Now I need to know what exactly is causing this. My Cursor agent noticed that my Prisma client and engine versions were not aligned. I fixed this by pinning them to 6.19.0 and verified the versioning for both in production by adding logging.

However, that fix did not address the root cause. I'm still seeing periodic seg faults and exit with status = 139.

Railway People: Here's the timing for some of the exits from our "backend" service. Can you look under hood and give me any more clues as to cause of the exits at these times??

---

2025-12-02T01:08:36.254122182Z [err] Segmentation fault

2025-12-02T01:08:36.254126194Z [err] [WRAPPER] backend exited with error status=139

2025-12-02T01:33:39.249690585Z [err] Segmentation fault

2025-12-02T01:33:39.249704127Z [err] [WRAPPER] backend exited with error status=139

2025-12-02T01:55:32.052545561Z [err] Segmentation fault

2025-12-02T01:55:32.052552147Z [err] [WRAPPER] backend exited with error status=139

2025-12-02T06:37:58.196049192Z [err] Segmentation fault

2025-12-02T06:37:58.196055803Z [err] [WRAPPER] backend exited with error status=139

2025-12-02T06:40:59.021172509Z [err] Segmentation fault

2025-12-02T06:40:59.021178704Z [err] [WRAPPER] backend exited with error status=139

2025-12-02T06:45:52.244560766Z [err] Segmentation fault

2025-12-02T06:45:52.244565010Z [err] [WRAPPER] backend exited with error status=139

2025-12-02T06:58:25.038102822Z [err] Segmentation fault

2025-12-02T06:58:25.038110579Z [err] [WRAPPER] backend exited with error status=139

2025-12-02T06:58:27.254112303Z [err] Segmentation fault

2025-12-02T06:58:27.254118294Z [err] [WRAPPER] backend exited with error status=139

paulkellis

PROOP

7 months ago

I'm marking this as resolved. Thanks to @rafchik-dipstick's tip on the wrapper to catch error codes, I was able to catch a seg fault exit code of 139. This led to further examination of Prisma and associated interactions. There was a highly sub-optimal/redundant calling pattern in my web app that led to multiple concurrent calls to get the same record in the database. This slammed into a known bug with prisma library around race-condition and concurrency. The ultimate solution was to reduce unneeded concurrent calls. And I'm also trying to switch to the binary version of prisma which runs in a separate process from the JS app and apparently does not have this concurrency bug.

Thank you @rafchik-dipstick for getting me unblocked and on a path to solving the problem!

Status changed to Open ray-chen • 7 months ago

Status changed to Solved ray-chen • 7 months ago

Welcome!