Loading...

Memory leak specific to Metal.

milestonetech

PRO

a month ago

We've been kicking the tires for a while and starting to look at bringing some production workloads in. Just as we began to do so...we started getting container crashes. Noticed that we were auto converted to metal (legacy had been working for a while) and sure enough...if i swap between legacy and metal...legacy performs as expected, metal...never dumps the garbage on the java heap and crashes almost immediately.

Specifically...customized for Java 17 OpenJDK + nginx. Migrated to Java 17 at advice from Claude4 Thinking. I've been all over the place on this issue at this point...hitting dead ends.

We've set the Java heap to fixed amounts, percentages of available memory, so forth and so on. Havent dove deep enough to see, 1000%, if there isnt some auxiliary service that is causing the leak but...the java build is the only thing happening inside of the container/docker build, minus a utility nodejs lightweight variant that I can't begin to imagine would have impact.

Nonetheless, works perfectly in legacy. Is there any advice or things I should try specific to metal? I have very little understanding of whats happening behind the curtain on the metal instances.

Attaching a observability screenshot for your reference. Notice, performing correctly at 7:20pm, does a garbage collection, goes below 1gb again, this is all on legacy. I redeploy to METAL at 7:40...note the immediate climb without any reduction until it crashes at 7:50. Redeploy to legacy, back to normalization. All same behavior inbetween.

Thanks for any thoughts!

Attachments

image.png

Solved$50 Bounty

5 Replies

itsrems

EMPLOYEE

a month ago

Heya, it's possible we have different JVM defaults on metal - will flag with the team. what are your -xms and -xmx flags set to ?

Status changed to Awaiting User Response Railway • about 1 month ago

milestonetech

PRO

a month ago

Hey Rems...just FYI...I'm monitoring the situation still. I got a notification that Railway completed a move to all Metal, my test containers were upgraded again from Legacy, and since this has happened...the memory may be behaving normally. Give me another 3 days to respond and we can, fingers crossed, close the ticket if the volatility doesnt return. Otherwise I'll provide the xms/xmx. Thanks!!

Status changed to Awaiting Railway Response Railway • about 1 month ago

milestonetech

PRO

a month ago

Hi Rems, unfortunately I dont think its resolved. You'll note in this screenshot...the shelf right after June 30th duckoff...was when Railway transitioned us back to metal. It pretty immediately shot up from there and has never done garbage collection since.

Here are my heap settings:

export CATALINA_OPTS="-Xms512m -Xmx4096m -XX:MaxPermSize=256m"

Jun 30 08:05:19

export JAVA_OPTS="-Xms512m -Xmx4096m -XX:MaxPermSize=256m"

Jun 30 08:05:19

export LUCEE_JAVA_OPTS="-Xms512m -Xmx4096m -XX:MaxPermSize=256m"

Jun 30 08:05:19

2025-06-30 14:05:19: INFO: Starting Tomcat...

As you can see, max runup at 4gb has been exceeded (the overhead on this container appears to be running around 100mb...200mb...)

I would wait to see if this instance crashes but for environment balancing reasons I can't force it to do so right now. I have to move production workloads by end of July if we're going to swap from AWS, so trying to get ahold of this. Legacy worked without issue. Thanks!

Attachments

image.png

milestonetech

PRO

a month ago

(removed...canned response deleted from a user)

Status changed to Solved milestonetech • 25 days ago

milestonetech

PRO

25 days ago

Had to rearchitect the whole solution. For folks that run into an issue in the future...its not worth the rabbit hole. Rearchitect with a different variant of Java that shows heap and GC release.

Status changed to Awaiting Railway Response Railway • 25 days ago

Status changed to Solved milestonetech • 25 days ago