MongoDB service being hard-killed by platform — no SIGTERM, no resource pressure
kolbertistvan2
PROOP

22 days ago

Project: kolbertai-librechat-new

Project ID: 82d941c6-d0e9-4f6d-944b-b149fa174684

Service: MongoDB (ID: 463ce54c-654e-4b03-96a0-02e8aa94a6f9)

Environment: production

Region: europe-west4-drams3a

Hi Railway team,

My MongoDB service is being hard-killed by the platform every 30-90

seconds with no graceful shutdown. The mongod process logs NO shutdown

event — no SIGTERM, no SIGINT, no "received signal". The next thing in

the logs is a brand-new container starting up with a new hostname,

mounting the same volume.

Example from today (UTC):

17:08:13 mongod startup complete (container: previous host)

17:08:27 connection accepted from LibreChat (Mongoose)

17:08:37 another connection accepted

↓ ~35 seconds, no shutdown log AT ALL

17:09:12 "Mounting volume on: /var/lib/containers/railwayapp/bind-mounts/..."

        New MongoDB starting, host: f26a59bc81ec (NEW container ID)

Each restart logs "Detected unclean shutdown - Lock file is not empty"

because the previous mongod was SIGKILL-ed without cleanup.

Resource utilization (Pro plan):

  • CPU: 0 vCPU avg, max 0.01 vCPU (limit 32 vCPU)
  • Memory: 187 MB avg, 369 MB max (limit 32 GB)
  • Disk: 2.08 GB used / 48.8 GB volume
  • Network: 0 MB public traffic

So this is NOT OOM, NOT CPU starvation, NOT disk full.

This service was stable for ~6 weeks (from 2026-04-07). The problem

started around 2026-05-20 and has gotten progressively worse — initially

hours between crashes, now under a minute. Restarting via `railway

redeploy` only buys 1-5 minutes before the next platform-level kill.

Mongo version: 8.2.9 (mongo:latest)

Start command: docker-entrypoint.sh mongod --ipv6 --bind_ip ::,0.0.0.0

Healthcheck: none configured

Restart policy: ON_FAILURE, maxRetries: 10

Single instance, single region (europe-west4-drams3a)

Could you investigate what's killing the container at the platform

level? Is there a host rebalancing operation, volume migration, or

health probe causing this? The LibreChat app depends on this database

and is currently unstable as a result.

Volume bind-mount path I see in logs:

/var/lib/containers/railwayapp/bind-mounts/7087d3d1-350f-49ac-9b93-f034df2154ca/vol_4v9qoj7y0ncoix7r

Thanks!

$20 Bounty

2 Replies

Status changed to Awaiting Railway Response Railway 22 days ago


Railway
BOT

21 days ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open Railway 21 days ago


sheeki03
FREE

21 days ago

I would separate two things: what Railway can confirm internally, and what you can rule out from the service side.

Because this service has no healthcheck configured and the restart policy is ON_FAILURE, this does not look like a normal app healthcheck loop. Railway should be restarting it only because the container process exits, or because something below mongod kills the container. The missing MongoDB shutdown line makes a SIGKILL-style stop plausible, but the runtime event should still have an exit code.

The first thing I would check is the exit code for the container that died between 17:08 and 17:09.

If it is 137, treat it as SIGKILL or a cgroup/platform-level kill, even if the service metrics do not show normal memory pressure.

If it is 143, it received SIGTERM somewhere.

If it is 1, 2, or 100, then mongod is exiting itself and the useful log line is probably earlier in the MongoDB output.

Two service-side changes are worth making while Railway checks the host/container event:

  1. Stop running the database on mongo:latest. Pin the image to the exact MongoDB version you want, ideally the version that was running during the stable period. Docker Hub shows library/mongo:latest was updated recently, and your timeline starts after that, so pinning removes one big variable for a database service.

  2. Take a volume backup or snapshot before repeated unclean shutdowns continue. The Detected unclean shutdown message is recoverable until it is not, so I would protect the data first and debug second.

I would also run one controlled redeploy with the simplest MongoDB command, without the custom IPv6 bind flags, using the official image's normal bind behavior or --bind_ip_all. If that stabilizes it, the custom command or IPv6 path is involved. If it still dies with the same exit code and no MongoDB shutdown line, then the cause is probably below mongod and Railway needs to inspect the host/container runtime event for that exact service and time window.


kolbertistvan2
PROOP

10 days ago

Hi team,

Following up on my earlier ticket about the MongoDB service in

kolbertai-librechat-new being hard-killed without SIGTERM.

I've applied all the recommendations from your previous response:

  1. ✅ Pinned image to specific version (mongo:8.2.9 by tag, was mongo:latest)

  2. ✅ Simplified the start command: removed --ipv6 --bind_ip ::,0.0.0.0

    and now using docker-entrypoint.sh mongod --bind_ip_all per your suggestion

  3. ✅ Bumped restart policy to ALWAYS / max 1000 retries

The image pin helped reduce the crash frequency (previously every

30-60 seconds, now intermittent over hours/days), but the issue is

NOT resolved. The container is STILL being hard-killed.

Latest evidence — TODAY 2026-06-02:

  • 09:09 UTC: Mongo crashed (deployment 5db22f80) → manual redeploy
  • 10:54 UTC: I redeployed (deployment 7df638ea, current)
  • 10:55:01 UTC: mongod startup complete
  • 10:56:00 UTC: ANOTHER unclean shutdown + restart, only 59 seconds later

The MongoDB process still logs "Detected unclean shutdown - Lock file

is not empty" on every startup, meaning the previous mongod was SIGKILL'd

without graceful shutdown. No "received signal" or "shutting down" lines

appear in the MongoDB logs before each kill — this is conclusive evidence

that the kill comes from the platform, not from mongod itself.

Project / service details:

  • Project: kolbertai-librechat-new (id: 82d941c6-d0e9-4f6d-944b-b149fa174684)

  • Service: MongoDB (id: 463ce54c-654e-4b03-96a0-02e8aa94a6f9)

  • Environment: production (id: dc28c69b-5f26-48c5-abff-b97d80ecb21b)

  • Region: europe-west4-drams3a

  • Current deployment: 7df638ea-4606-4f0d-bdef-f04163380c64

  • Plan: Pro

  • Resources: CPU 0%, RAM 200-369 MB (limit 32 GB), Disk 2 GB / 48 GB

    — NO resource pressure

Please look up the container/host events for deployment IDs:

  • 7df638ea (today 10:54 UTC, the current one)
  • 5db22f80 (today 09:09 UTC)
  • db8705e5 (2026-05-25 12:10 UTC)
  • de13b2d7 (2026-05-23 21:40 UTC — was stable ~34h then crashed)

The exit code of those killed containers would tell us:

  • 137 = SIGKILL from cgroup / platform
  • 143 = SIGTERM
  • 1/2/100 = mongod self-exit

If it's 137 with no resource pressure on our side, the kill is coming

from your platform's host management (rebalancing, migrations, or

similar). We need a stable host placement or at least an explanation

of what's triggering these kills so we can adapt.

LibreChat (the app depending on this DB) is in production use; every

Mongo crash takes the chat down for several minutes until I manually

redeploy.

Thanks!


Welcome!

Sign in to your Railway account to join the conversation.

Loading...