Container Memory Spikes & OOM Before 32GB – Production Service
Anonymous
FREEOP

2 months ago

Hello Railway Team,

We are running a production service on Railway with a 32 GB RAM allocation. Under normal conditions, memory remains stable.

However, during specific workload bursts (notably when multiple clients connect in a short time window), memory usage spikes sharply (8–12 GB, sometimes exceeding 20–25 GB). In several cases, the service crashes with an Out of Memory error before reaching the full 32 GB limit.

We want to understand:

  • Could you clarify if there are any platform-level memory management policies that might terminate a container before it reaches the full 32 GB allocation?

  • Are there recommended best practices for handling sudden memory spikes to prevent unexpected OOM crashes?

  • Can you provide guidance on monitoring or configuring containers to ensure they can safely utilize the full allocated memory under bursty workloads?

We’ve attached a screenshot showing the spike pattern we’re observing in production.

This is impacting a live system, and we need to know whether the platform itself can be contributing to these early OOM events, or if the process should always be allowed to reach the full allocated memory.

Happy to provide logs, metrics, or container details if needed.

Best regards,
Huzefa Nalkheda WalaUploaded image

Attachments

Solved

5 Replies

2 months ago

Hello,

Since your app crashed due to OOM errors, it did indeed use the full 32 GB of memory.

You aren't seeing that in the memory metrics because of the memory polling frequency we use, compounded by the summing and averaging we do.

So, since your memory increase is instantaneous, the graph missed the true maximum value, but in this case, since there was an OOM, the maximum value was 32 GB.

All in all, I can assure you there are no early OOM events here; it's just uncharted data.

Best,
Brody


Status changed to Awaiting User Response Railway about 2 months ago


Anonymous
FREEOP

2 months ago

Hello Brody,

Thanks for the clarification. We understand that the graph may miss instantaneous peaks and that, in OOM cases, the process does indeed reach the full 32 GB.

However, the core problem for us is not just where the OOM happens, but how the memory grows in practice.

In real production traffic, we repeatedly observe patterns like:

  • Long periods of stable memory usage

  • Sudden jumps of 8–10 GB within seconds

  • Sometimes a return to baseline

  • Other times, continued growth until the full 32 GB is exhausted and OOM occurs

These are not gradual ramps—they are abrupt, burst-driven allocations tied to real user events (notably when multiple clients connect or link in a short window). From an operational standpoint, this is extremely risky: there is no opportunity for mitigation, autoscaling, or graceful handling. A single burst can take down the service.

So while it’s reassuring that there are no “early” OOMs at the platform level, the behavior is still problematic in production:

  • A single workload spike can consume tens of GB in seconds

  • The system becomes fragile under real-world burst patterns

  • Even with 32 GB allocated, stability cannot be guaranteed

We are actively working with the application vendor to reduce and bound this behavior, but we wanted to highlight that the issue is not just observability—it’s the severity and speed of these memory surges.

If Railway has any mechanisms, tooling, or best practices for handling extremely bursty memory profiles (for example, finer-grained alerting, hooks before OOM, or runtime-level guardrails), guidance would be very valuable for us.

Best regards,
Huzefa


Status changed to Awaiting Railway Response Railway about 2 months ago


2 months ago

Hello,

I'm sorry to say, but we wouldn't be able to help here unless you are interested in the enterprise plan, so that you can allocate 48GB of memory to that service.

Otherwise, the platform cannot do anything to prevent your app from using the available memory. Alerting wouldn't work, as the increase is instant, and hooks before OOM aren't a thing on any platform, let alone ours, unfortunately.

The only practical solution here would be to audit your application and figure out what it is doing to cause such a spike in memory consumption, and then fix it. That isn't something we can provide help for, as our platform has no observability into your application, nor are we able to provide application-level support in general.

I wish you all the best as your debug you application!

Best,
Brody


Status changed to Awaiting User Response Railway about 2 months ago


Status changed to Solved Anonymous about 2 months ago


Anonymous
FREEOP

a month ago

Hey we are facing same issues Can we have one meeting with you guys


Status changed to Awaiting Railway Response Railway about 1 month ago


a month ago

Our Enterprise offering for more memory, that would start at a commitment of $2,000 a month for a year.

If you are interested in that, I can get you in touch with sales. If you aren't, that's okay, but we would not be able to offer raised memory limits on the Pro plan.


Status changed to Awaiting User Response Railway about 1 month ago


Railway
BOT

a month ago

This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!

Status changed to Solved Railway about 1 month ago


Loading...