3 months ago
Hey team,
We’re running a busy MongoDB cluster on Railway, and the kernel OOM killer is terminating mongod when memory spikes hit the hard cap. Lowering WiredTiger cache hasn’t helped since page cache and working memory fill the gap.
Key Findings:
Memory: Hard cap ~32 GB (
memory.max=32000000000), no soft limit (memory.high=max), swap disabled. Memory events show >500k hard cap hits.THP: Enabled (always) - MongoDB recommends disabling due to latency and memory overhead.
Disk Readahead: Likely above MongoDB’s recommended 8–32 sectors, which can waste memory.
Impact: No room for kernel reclamation, causing immediate OOM kills. SSH often fails or drops due to memory pressure.
This didn’t happen on Atlas since they tune memory.high, provide swap, and disable THP by default.
Without these settings, the container memory fills completely regardless of our MongoDB tuning. With no soft cap or swap to give the kernel room to reclaim memory, any spike pushes us over the limit and the OOM killer immediately terminates mongod.
Questions:
Can we set memory.high (e.g. 24–26 GB) ourselves, or is this host-level?
Can we enable small swap (2–4 GB), or does Railway need to configure this?
Can THP be disabled (never) via our start script, or does this require host-level changes?
What’s the current disk readahead value, and can we set it to 8–32 sectors ourselves?
Getting these tuned should stabilize our cluster and prevent OOM terminations. Curious if we're able to do this.
21 Replies
3 months ago
!t
3 months ago
This thread has been escalated to the Railway team.
Status changed to Awaiting Railway Response noahd • 3 months ago
3 months ago
If i remember correctly your mongo clusters were the ones with ~500g of data correct?
3 months ago
Sounds good. I don't have the best answer for this so I escalated to the team.
They'll be able to provide much better and more accurate help!
3 months ago
Of course!
3 months ago
Since we are currently using a container-based runtime with a shared kernel, everything you mentioned is going to be a host-level setting that we cannot change, as that would affect all containers on that particular host.
These issues will be solved with our runtime v3, which is VM-based, so you can tune these settings yourself, but until then, unfortunately, the most we can do for you is offer you 48GB of RAM on Enterprise.
And shoot, basically doubling our bill overnight for enterprise just to experiment with more memory which may not even solve this
3 months ago
IIRC, we are aiming for the end of the next quarter, but that's just an estimate.
3 months ago
Yeah, it's not ideal, and it might be odd to hear from an employee, but for this specific workload, it might make sense for it to be run offsite.
we unfortunately just did a long expensive migration here.
will have to consider what to do
3 months ago
It's set, set to max
3 months ago
No problem, sorry we couldn't do more for you here.
Status changed to Solved brody • 3 months ago
3 months ago
!s