Scary / Interesting Railway bug?
-ok-
PROOP

2 years ago

I have two nodejs services deployed from the same GitHub repo (but with different build and start commands). One of them runs in the background (let's call it "server"), and the other runs on a cron schedule (let's call it "cron").

I don't have a dev / stage / prod workflow- I normally just push changes to GitHub after testing them locally, and then I verify that the services have deployed to Railway and have a green check mark next to them.

Yesterday, I pushed some changes to GitHub.
The services redeployed on Railway.app, and had the green check marks.
I verified that things were working for one of the services (the "server") by interacting with it - and it seemed fine, so I continued with my day.

This morning I realised that my cron jobs weren't running…
Upon looking into it, it showed a history of jobs that were all "Skipped"… strange - why would my crons be "skipped"?
I tried running it manually but got an error (unfortunately didn't capture it).

Upon some digging, I realised that the "deployment" was empty - there was no image. So somehow pushing to GitHub caused an empty deployment with no image at all - and that overwrote my previous successful deployment.
There weren't any build logs, there was nothing to see - if I click on a deployment, the details tab is empty… The raw configuration file contains only the following:
{
"$schema": "https://railway.app/railway.schema.json"
}

Lesson learnt to doublecheck both my services with every change - but this is scary to me- the green checkmark indicated that things were fine, and Railway clearly thought that things were fine because it deleted my previous deployment and replaced it with an empty deployment somehow.

Also, because there was no deployment, I couldn't just re-deploy, or restart, or do any of the usual things - I had to make a tiny change to the code (increased the version info) and push the change to Github to fix this issue.

I'm relatively new to Railway, so not sure if I'm doing something wrong - but in the past month or so of using it, this is the first time I've seen this behaviour, and for now I'm guessing this is a bug.

12 Replies

brody
EMPLOYEE

2 years ago

That doesn't sound like anything you could have ever done wrong, even on purpose, but who knows, if it ever happens again report back, and then I'll raise it to the team.

And FWIW you can open a service up and do ctrl / cmd + k -> Deploy latest commit and Railway will redeploy the latest commit for you.


-ok-
PROOP

2 years ago

@brody - agreed, I don't think I could trigger this behaviour intentionally if I tried.
Thanks for the tip on the CMD+K => Deploy latest comment, that's helpful.
Do you think it makes sense to wait for me/someone else to face the same issue before the Railway team looks into it?
Given that this is fresh and logs might still exist - wouldn't it make sense for them to at least look into what caused this glitch, even if they decide not to fix the underlying issue if it's a rare edge case?


brody
EMPLOYEE

2 years ago

Would you be able to provide some exact time stamps of when you pushed that commit that ended up in an empty deployment?


-ok-
PROOP

2 years ago

Surprisingly tricky to figure that out -

  • git log shows the time when I commit the changes, not when I sync with origin…

  • And GitHub shows an approximate time (21 hours ago)…

Luckily, since the change was pushed to both my services - and worked successfully for my "server", I could see the first timestamp on the build log there:

UTC Time: 29 Apr 2024 7:40:17 am

So the exact time the change was pushed to github would probably be within a few seconds of that.


-ok-
PROOP

2 years ago

The same thing happened again.
No new code changes had been pushed, the cron jobs were running fine, until they suddenly weren't.

I was traveling, came back last night - discovered that 5 jobs were skipped - for no apparent reason.
Config details for the skipped jobs are the same as earlier - only: { "$schema": "https://railway.app/railway.schema.json" }

Also, my trial plan had 1.5 $ left on it, and going by previous history, this should have lasted me a couple of weeks at least.
But surprisingly during my 4 day trip, I burnt through over 1.2$ with only 0.3$ left - so somehow resources were being consumed at a much higher rate.
This cron job service which runs for just 10 seconds at a time x 4 times a day (let's say under 15 minutes total every day - including build / deployment times) - shows that it's consuming 1 VCPU continuously throughout the day.

The logs show it exited at the end of the last successful cron job, so I have no idea why the CPU usage is high, and why it was skipping the last several runs.

Thanks to @brody I was able to quickly fix this with a "Deploy latest commit".

Could it be because my build output is a 2MB+ javascript file? Is that too large for Railway to reliably handle?
To be honest, I'm not sure what else it could be… Or how to debug it because the logs don't indicate any issue.

Attachments


brody
EMPLOYEE

2 years ago

Provide a project and service Id and I will esclate this thread.


-ok-
PROOP

2 years ago

I could get my service ID from the Observability tab... @service:bb4f0296-4647-4f3f-a0d2-e99cc6a68b12But not sure how to get my Project ID - if the URL is anything to go by, it would be https://railway.app/project/48b43858-db82-43e1-b9fb-96051789dbba

Is this safe to share publicly on here?


brody
EMPLOYEE

2 years ago

Yep, none of the project / environment / service / deployment / etc IDs are sensitive information.


brody
EMPLOYEE

2 years ago

This thread has been escalated to the Railway team.

Status changed to Awaiting Railway Response brody over 1 year ago


-ok-
PROOP

2 years ago

How long does it typically take for Railway to respond?


brody
EMPLOYEE

2 years ago

Railway does not guarantee a response time for Hobby users, it's currently the weekend so they will get to this thread on a weekday.


ray-chen
EMPLOYEE

2 years ago

Hey, could you try re-creating the cron service to see if it happens again?


-ok-
PROOP

2 years ago

At the moment, things seem fine - so I'm not sure how it would help to delete the service and create it again.

The metrics have also come back down to normal (I didn't do anything other than redeploy the latest commit).

This seems to be an erratic issue - not easy to replicate - but it's the second time it's happening (see the thread above).

I can certainly re-create the cron service, but could you please explain your thinking here - how would that help?

Attachments


Loading...