2 months ago
Hello Railway Team,
We are running a production service on Railway with a 32 GB RAM allocation. Under normal conditions, memory remains stable.
However, during specific workload bursts (notably when multiple clients connect in a short time window), memory usage spikes sharply (8–12 GB, sometimes exceeding 20–25 GB). In several cases, the service crashes with an Out of Memory error before reaching the full 32 GB limit.
We want to understand:
Could you clarify if there are any platform-level memory management policies that might terminate a container before it reaches the full 32 GB allocation?
Are there recommended best practices for handling sudden memory spikes to prevent unexpected OOM crashes?
Can you provide guidance on monitoring or configuring containers to ensure they can safely utilize the full allocated memory under bursty workloads?
We’ve attached a screenshot showing the spike pattern we’re observing in production.
This is impacting a live system, and we need to know whether the platform itself can be contributing to these early OOM events, or if the process should always be allowed to reach the full allocated memory.
Happy to provide logs, metrics, or container details if needed.
Best regards,
Huzefa Nalkheda Wala
Attachments
5 Replies
2 months ago
Hello,
Since your app crashed due to OOM errors, it did indeed use the full 32 GB of memory.
You aren't seeing that in the memory metrics because of the memory polling frequency we use, compounded by the summing and averaging we do.
So, since your memory increase is instantaneous, the graph missed the true maximum value, but in this case, since there was an OOM, the maximum value was 32 GB.
All in all, I can assure you there are no early OOM events here; it's just uncharted data.
Best,
Brody
Status changed to Awaiting User Response Railway • about 2 months ago
2 months ago
Hello Brody,
Thanks for the clarification. We understand that the graph may miss instantaneous peaks and that, in OOM cases, the process does indeed reach the full 32 GB.
However, the core problem for us is not just where the OOM happens, but how the memory grows in practice.
In real production traffic, we repeatedly observe patterns like:
Long periods of stable memory usage
Sudden jumps of 8–10 GB within seconds
Sometimes a return to baseline
Other times, continued growth until the full 32 GB is exhausted and OOM occurs
These are not gradual ramps—they are abrupt, burst-driven allocations tied to real user events (notably when multiple clients connect or link in a short window). From an operational standpoint, this is extremely risky: there is no opportunity for mitigation, autoscaling, or graceful handling. A single burst can take down the service.
So while it’s reassuring that there are no “early” OOMs at the platform level, the behavior is still problematic in production:
A single workload spike can consume tens of GB in seconds
The system becomes fragile under real-world burst patterns
Even with 32 GB allocated, stability cannot be guaranteed
We are actively working with the application vendor to reduce and bound this behavior, but we wanted to highlight that the issue is not just observability—it’s the severity and speed of these memory surges.
If Railway has any mechanisms, tooling, or best practices for handling extremely bursty memory profiles (for example, finer-grained alerting, hooks before OOM, or runtime-level guardrails), guidance would be very valuable for us.
Best regards,
Huzefa
Status changed to Awaiting Railway Response Railway • about 2 months ago
2 months ago
Hello,
I'm sorry to say, but we wouldn't be able to help here unless you are interested in the enterprise plan, so that you can allocate 48GB of memory to that service.
Otherwise, the platform cannot do anything to prevent your app from using the available memory. Alerting wouldn't work, as the increase is instant, and hooks before OOM aren't a thing on any platform, let alone ours, unfortunately.
The only practical solution here would be to audit your application and figure out what it is doing to cause such a spike in memory consumption, and then fix it. That isn't something we can provide help for, as our platform has no observability into your application, nor are we able to provide application-level support in general.
I wish you all the best as your debug you application!
Best,
Brody
Status changed to Awaiting User Response Railway • about 2 months ago
Status changed to Solved Anonymous • about 2 months ago
a month ago
Hey we are facing same issues Can we have one meeting with you guys
Status changed to Awaiting Railway Response Railway • about 1 month ago
a month ago
Our Enterprise offering for more memory, that would start at a commitment of $2,000 a month for a year.
If you are interested in that, I can get you in touch with sales. If you aren't, that's okay, but we would not be able to offer raised memory limits on the Pro plan.
Status changed to Awaiting User Response Railway • about 1 month ago
a month ago
This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!
Status changed to Solved Railway • about 1 month ago