a month ago
Hi Railway team -
Reaching out in need of some real help and clarity. We have our full stack deployed on Railway - 100% bought in, multiple environments, development environments, the works. We're really looking to stay on Railway and love the product, but are having production outage issues that are forcing us to reconsider.
My main production server (docker build, bun TS runtime, hono framework, straightforward HTTP api server talking to a react client and a Neon PostgreSQL db, nothing novel) has been seeing MAJOR performance issues and crashes for the past week. Screenshots of metrics and config attached.
We know that we have some queries to optimize and some places in our code that are likely triggers, and are working on it. However, we're paying for 16gb ram/4x CPU instaces (3 of them in production alone), and seeing absolute crashes at tiny fractions of those metrics.
Code optimization aside, I'm having real trouble seeing any reason why the level of machines our code is meant to be running on would have issues with the volume (or even the spikes) we're sending their way. We have developers that can run the same applications on MUCH less powerful personal machines with no problem at all, even when intentionally triggering the worst case spikes in volume, query size, throughput, etc.
I can't see why reaching 3gb of ram on a supposed 16gb machine would cause fatal errors that don't recover.
I'm admittedly not a pro devops eng. I'm coming asking for your help and insight, because I really hope there's something I'm overlooking here, causing this gap between our expectation based on config and the reality of these crashes. I'm leaving room for potential unexpected thresholds being set by Docker, or Bun, or Hono, etc limiting our processes in some way that I'm not seeing correctly. But because this all runs flawlessly on the same docker images locally (with significantly less resources) without a hitch, I'm not seeing that as a likely conclusion.
To the best of my ability, this looks a lot like our instances just don't have the resources we're being told they do. Maybe that's Railway attempting to vertically scale only when necessary, and not responding quickly enough when switching between containers. I don't know. Similar scalable services we use (Neon for example) offer us the clarity of low and high CPU thresholds, which gives us better confidence around exactly what resources are available in the best and worst case. But I could really use you're help in figuring out what's going on.
Thanks in advance for your help, would be more than happy to provide any extra details I can, or even schedule a call if that could provide more insight.
Attachments
4 Replies
a month ago
This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.
Status changed to Open Railway • 27 days ago
a month ago
Hey, I can absolutely work through this with you if you'd like. Sometimes an outside perspective helps. I'd love to hear back from ya!
a month ago
okay I can reach this I have a similar issue before just a google meet with me I will solve it for you omarabdelghany56@gmail.com send me any email here and I will send you the meeting link
a month ago
One thing I can recommend to improve the performance of your infrastructure, is to migrate the database from Neon to Railway. This way, you will have lower latency if you connect your backend to the database via private network. And you will cut down on egress costs, saving you money, as your backend won't have to reach to the public network to communicate with your db.
This migration is recommended if your traffic is consistent. If your traffic is not consistent however (idle for sometime, then hit with requests), Neon might still be architecturally better, despite the latency. Because neon is serverless and scales down to 0 when idle.
a month ago
As for the crashes you are having, can you share what's the specific cause? Any error logs?