high memory and cpu usage from dec 23-30

artivilla

HOBBYOP

6 months ago

Unclear what caused this spike in CPU and memory usage for these dates. The nextjs app has 0 traffic since I had not been working on it. How do I tell which dates the 'Metal Build Environment' was put in place? I removed it for now to see if this was causing the issue.

I noticed there are very high error rates during that time period. How do I debug what's causing this?

Solved$10 Bounty

8 Replies

peacocksir

HOBBY

6 months ago

Railway introduced metal environments at the start of 2025, and by May, they started rolling it out to most services.

Railway is migrating all services to metal by default, so you can't completely opt out of it

You can't really check when it was applied to your exact service, but if you didn't turn it off (in service settings, builder), then it was likely used to build your service.

The metal build environment shouldn't be increasing usage much if any (from what i know), so its probably something with the app.

If your app was having a bunch of issues or something, or running a lot of background processes during that time, that might contribute to the increase.

The errors are probably the main factor if your service is restarting a bunch or even redeploying, then that would increase usage.

If you can give more info like error logs or pictures of the usage or other settings, that would be good.

artivilla

HOBBYOP

6 months ago

Looking at the error rate, it doesn't look like memory and CPU is in sync with error rates. As the deployment happened on Jan 2nd, and prior to that, I had not made any changes in over a month. What should I be searching for in my logs?

peacocksir

HOBBY

6 months ago

Ok, thanks for the info. Since you hadn't made any changes to anything, it probably means it's one of the following:

Crashed on startup

Exceeded usage limits before being killed

Failed to bind to port

Exited due to an exception

Search for these in the logs to narrow down the causes

SIG

exit

killed

PORT

Unhandled

And anything else

Also, look in the http logs to see if there's anything there

One thing I can think of is that you might be running next dev instead of next start so make sure you are starting things right. Redeploying would restart the service and reset stuff so it would temporarily prevent issues, but eventually they would come back if you were running it in dev.

If you can send the logs around December 29 or anything notable there, that would be good

yusufmo1

PRO

6 months ago

high cpu/memory with a high error rate that all cleared up after a redeploy - is classic crash loop behavior. when your app crashes and railway restarts it, it burns resources during startup, fails again, restarts again, and so on. this would explain why the metrics don't perfectly sync with error rate - the resource usage is from repeated startup attempts, not from handling requests.

a few things to check: first, what version of Next.js are you running? there were known memory leaks in 15.1.x that got patched in 15.1.7. if you're on an affected version, your app could've slowly accumulated memory until it hit the limit and started crash looping. second, search your deploy logs from that period for "exit status 137" or "Killed" - that's the OOM killer signature. you can also search for "SIGTERM" to see restart events. railway's log explorer lets you filter by date range which should help.

peacocksir

Railway introduced metal environments at the start of 2025, and by May, they started rolling it out to most services. Railway is migrating all services to metal by default, so you can't completely opt out of it You can't really check when it was applied to your exact service, but if you didn't turn it off (in service settings, builder), then it was likely used to build your service. The metal build environment shouldn't be increasing usage much if any (from what i know), so its probably something with the app. If your app was having a bunch of issues or something, or running a lot of background processes during that time, that might contribute to the increase. The errors are probably the main factor if your service is restarting a bunch or even redeploying, then that would increase usage. If you can give more info like error logs or pictures of the usage or other settings, that would be good.

brody

EMPLOYEE

6 months ago

Please note that the metal build environment has absolutely nothing to do with the runtime environment, they are completely separate mechanisms.

artivilla

HOBBYOP

6 months ago

got it. its gone back down so I'm just going to observe if this happens again.

artivilla

got it. its gone back down so I'm just going to observe if this happens again.

crisog

PRO

6 months ago

What is your Next.js version? It's important you upgrade if your version is vulnerable as it might be related to your high memory and CPU usage. You can read more here.

artivilla

HOBBYOP

6 months ago

That makes sense since I made the upgrade to next ^15.5.9 from ^15.2.4 5 days ago. This is likely related to the vulnerability.

Status changed to Solved crisog • 6 months ago

Welcome!