CRON service failing to run and exit

tonkotsu

PROOP

5 months ago

I have a (Bun) service running every 5 minutes that checks values in a table and then pushes certain jobs to a BullMQ queue.

Three days ago (possibly earlier), the service would go from running and finishing in 5-10 seconds to running for the full 5 minutes between runs.

Logs would only show a repeated "Starting Container" with no other logs (including logs from within the code). It appears that no code runs in this time.

Redeploying the service seems to "fix" this for 30 minutes to an hour, but then it will start messing up again.

As far as I can tell, there haven't been any notable changes to Bun, Railpack, etc that exactly line up with when this started.

Solved

12 Replies

tonkotsu

PROOP

5 months ago

Service:

https://railway.com/project/b4c0db64-2177-406d-8559-65238726654b/service/759fee7f-b630-43b4-b003-86fead5c0646/

angelo-railway

EMPLOYEE

5 months ago

Hey there,

So I did a deep dive investigation on the cron and, I didn't see anything wrong with the cron it's self not starting. It could be that we're not forwarding logs. Can you deploy one more time (I know, sorry) and this way I can monitor the behavior over the period to see what might have gone wrong. That said, if it's critical, I would have your code run 24/7 and then use the native Bun cron to check against the service so you can keep fidelity there.

Status changed to Awaiting User Response Railway • 5 months ago

angelo-railway

Hey there, So I did a deep dive investigation on the cron and, I didn't see anything wrong with the cron it's self not starting. It could be that we're not forwarding logs. Can you deploy one more time (I know, sorry) and this way I can monitor the behavior over the period to see what might have gone wrong. That said, if it's critical, I would have your code run 24/7 and then use the native Bun cron to check against the service so you can keep fidelity there.

tonkotsu

PROOP

5 months ago

I've redeployed the service!

Just for a bit of clarity, when the service starts to "misbehave" on runs, it stops pushing jobs to the BullMQ queue, so I am relatively certain that this issue goes beyond the logs not appearing.

Also, I will be working on moving this from the current approach to running in a different, always-running process. I will likely be getting this pushed out early this coming week, but in the meantime if there's any solution you can find it would be appreciated.

Thank you for taking a look at this!

Status changed to Awaiting Railway Response Railway • 5 months ago

angelo-railway

EMPLOYEE

5 months ago

Noted! Looking into it now and keeping a tab open.

Status changed to Awaiting User Response Railway • 5 months ago

angelo-railway

Noted! Looking into it now and keeping a tab open.

tonkotsu

PROOP

5 months ago

Have you been able to identify the source of the problem? I've noticed that the usage has gone way up due to this service acting up, so I'd like to stop it before incurring more charges.

I have a change ready to be deployed that will convert it to a 24/7 script with an internal CRON scheduler.

Status changed to Awaiting Railway Response Railway • 5 months ago

tonkotsu

PROOP

5 months ago

I did see that a much earlier version of the script was deployed ~18 hours ago and appears to be working fine.

I'll hold off deploying my change until I hear back.

tonkotsu

PROOP

5 months ago

Just checked again, the deployment has started to misbehave again.

tonkotsu

PROOP

5 months ago

Had to push the change. This should no longer be an issue in my case. but I would still like to know what happened if possible.

noahd

EMPLOYEE

5 months ago

Just to clarify are you using Railway's cron feature? The service you had linked is not a cron any longer. I assume the fix is to just switch it to a scheduled loop.

I don't have insight on the exact logs but we are working on a project which should fix not only cron reliability but a bunch of the "push code" -> "get it built" -> "get it live" flow.

Status changed to Awaiting User Response Railway • 5 months ago

noahd

Just to clarify are you using Railway's cron feature? The service you had linked is not a cron any longer. I assume the fix is to just switch it to a scheduled loop. I don't have insight on the exact logs but we are working on a project which should fix not only cron reliability but a bunch of the "push code" -> "get it built" -> "get it live" flow.

tonkotsu

PROOP

5 months ago

I was previously using Railway's CRON, running into an issue where it would start the container but fail to run any code. It also seemed to use a fair amount of resources considering it wasn't doing anything.

I've moved away from this, but if I need to run anything using Railway's CRON later I'd like to know why this happened so it won't happen again.

Status changed to Awaiting Railway Response Railway • 5 months ago

angelo-railway

EMPLOYEE

5 months ago

The container appeared "running" because the process started, but your application code never executed because it was blocked waiting for a database connection that couldn't be established. Since Railway's CRON skips the next scheduled execution when a previous one is still active, this created the cascading effect where the service would run for the full 5 minutes doing nothing, then get skipped repeatedly. Redeploying temporarily fixed it because it killed the hung container and started fresh, but the underlying database resource issue would eventually cause the same hang. If you use Railway's CRON in the future, watch for this specific error pattern in your logs and monitor your PostgreSQL memory metrics and connection counts.

Status changed to Awaiting User Response Railway • 5 months ago

Status changed to Solved tonkotsu • 5 months ago

angelo-railway

The logs revealed PostgreSQL shared memory exhaustion errors during the problematic period, specifically "could not resize shared memory segment to 8388608 bytes: No space left on device". This explains the behavior you observed. Your service was likely attempting to establish a database connection at startup, and when PostgreSQL couldn't allocate the required shared memory, the connection attempt would hang indefinitely. The container appeared "running" because the process started, but your application code never executed because it was blocked waiting for a database connection that couldn't be established. Since Railway's CRON skips the next scheduled execution when a previous one is still active, this created the cascading effect where the service would run for the full 5 minutes doing nothing, then get skipped repeatedly. Redeploying temporarily fixed it because it killed the hung container and started fresh, but the underlying database resource issue would eventually cause the same hang. If you use Railway's CRON in the future, watch for this specific error pattern in your logs and monitor your PostgreSQL memory metrics and connection counts.

tonkotsu

PROOP

5 months ago

Thank you!

Status changed to Awaiting Railway Response Railway • 5 months ago

Status changed to Solved tonkotsu • 5 months ago

Welcome!