a month ago
I have a (Bun) service running every 5 minutes that checks values in a table and then pushes certain jobs to a BullMQ queue.
Three days ago (possibly earlier), the service would go from running and finishing in 5-10 seconds to running for the full 5 minutes between runs.
Logs would only show a repeated "Starting Container" with no other logs (including logs from within the code). It appears that no code runs in this time.
Redeploying the service seems to "fix" this for 30 minutes to an hour, but then it will start messing up again.
As far as I can tell, there haven't been any notable changes to Bun, Railpack, etc that exactly line up with when this started.
12 Replies
a month ago
Service:
https://railway.com/project/b4c0db64-2177-406d-8559-65238726654b/service/759fee7f-b630-43b4-b003-86fead5c0646/
a month ago
Hey there,
So I did a deep dive investigation on the cron and, I didn't see anything wrong with the cron it's self not starting. It could be that we're not forwarding logs. Can you deploy one more time (I know, sorry) and this way I can monitor the behavior over the period to see what might have gone wrong. That said, if it's critical, I would have your code run 24/7 and then use the native Bun cron to check against the service so you can keep fidelity there.
Status changed to Awaiting User Response Railway • about 1 month ago
angelo-railway
Hey there, So I did a deep dive investigation on the cron and, I didn't see anything wrong with the cron it's self not starting. It could be that we're not forwarding logs. Can you deploy one more time (I know, sorry) and this way I can monitor the behavior over the period to see what might have gone wrong. That said, if it's critical, I would have your code run 24/7 and then use the native Bun cron to check against the service so you can keep fidelity there.
a month ago
I've redeployed the service!
Just for a bit of clarity, when the service starts to "misbehave" on runs, it stops pushing jobs to the BullMQ queue, so I am relatively certain that this issue goes beyond the logs not appearing.
Also, I will be working on moving this from the current approach to running in a different, always-running process. I will likely be getting this pushed out early this coming week, but in the meantime if there's any solution you can find it would be appreciated.
Thank you for taking a look at this!
Status changed to Awaiting Railway Response Railway • about 1 month ago
Status changed to Awaiting User Response Railway • about 1 month ago
angelo-railway
Noted! Looking into it now and keeping a tab open.
a month ago
Have you been able to identify the source of the problem? I've noticed that the usage has gone way up due to this service acting up, so I'd like to stop it before incurring more charges.
I have a change ready to be deployed that will convert it to a 24/7 script with an internal CRON scheduler.
Status changed to Awaiting Railway Response Railway • about 1 month ago
a month ago
I did see that a much earlier version of the script was deployed ~18 hours ago and appears to be working fine.
I'll hold off deploying my change until I hear back.
a month ago
Just checked again, the deployment has started to misbehave again. 
a month ago
Had to push the change. This should no longer be an issue in my case. but I would still like to know what happened if possible.
a month ago
Just to clarify are you using Railway's cron feature? The service you had linked is not a cron any longer. I assume the fix is to just switch it to a scheduled loop.
I don't have insight on the exact logs but we are working on a project which should fix not only cron reliability but a bunch of the "push code" -> "get it built" -> "get it live" flow.
Status changed to Awaiting User Response Railway • about 1 month ago
noahd
Just to clarify are you using Railway's cron feature? The service you had linked is not a cron any longer. I assume the fix is to just switch it to a scheduled loop. I don't have insight on the exact logs but we are working on a project which should fix not only cron reliability but a bunch of the "push code" -> "get it built" -> "get it live" flow.
a month ago
I was previously using Railway's CRON, running into an issue where it would start the container but fail to run any code. It also seemed to use a fair amount of resources considering it wasn't doing anything.
I've moved away from this, but if I need to run anything using Railway's CRON later I'd like to know why this happened so it won't happen again.
Status changed to Awaiting Railway Response Railway • about 1 month ago
a month ago
The logs revealed PostgreSQL shared memory exhaustion errors during the problematic period, specifically "could not resize shared memory segment to 8388608 bytes: No space left on device". This explains the behavior you observed. Your service was likely attempting to establish a database connection at startup, and when PostgreSQL couldn't allocate the required shared memory, the connection attempt would hang indefinitely.
The container appeared "running" because the process started, but your application code never executed because it was blocked waiting for a database connection that couldn't be established. Since Railway's CRON skips the next scheduled execution when a previous one is still active, this created the cascading effect where the service would run for the full 5 minutes doing nothing, then get skipped repeatedly. Redeploying temporarily fixed it because it killed the hung container and started fresh, but the underlying database resource issue would eventually cause the same hang. If you use Railway's CRON in the future, watch for this specific error pattern in your logs and monitor your PostgreSQL memory metrics and connection counts.
Status changed to Awaiting User Response Railway • about 1 month ago
Status changed to Solved tonkotsu • about 1 month ago
angelo-railway
The logs revealed PostgreSQL shared memory exhaustion errors during the problematic period, specifically "could not resize shared memory segment to 8388608 bytes: No space left on device". This explains the behavior you observed. Your service was likely attempting to establish a database connection at startup, and when PostgreSQL couldn't allocate the required shared memory, the connection attempt would hang indefinitely. The container appeared "running" because the process started, but your application code never executed because it was blocked waiting for a database connection that couldn't be established. Since Railway's CRON skips the next scheduled execution when a previous one is still active, this created the cascading effect where the service would run for the full 5 minutes doing nothing, then get skipped repeatedly. Redeploying temporarily fixed it because it killed the hung container and started fresh, but the underlying database resource issue would eventually cause the same hang. If you use Railway's CRON in the future, watch for this specific error pattern in your logs and monitor your PostgreSQL memory metrics and connection counts.
a month ago
Thank you!
Status changed to Awaiting Railway Response Railway • about 1 month ago
Status changed to Solved tonkotsu • about 1 month ago