Unresponsive deployment after some hours

a year ago

Hello, I've been using Railway to host my Telegram bot for more than an year and I never experienced this sort of issue until I enabled app sleeping a few days ago.
With that option enabled, the bot would just go to sleep without ever being able to wake (but that's fine, given I didn't take into account that a Telegram bot only pulls for updates).

So, when I noticed that I disabled app sleeping and allowed the bot to redeploy with the new configuration but in the next days the bot became unresponsive after a few hours. I tried to fix it redeploying it but it was always a temporary fix and now it seems to be stuck again (with no issues logged).

Can you please check if there is anything wrong with the configuration or the deployment of my project?

I'd be happy to share any other information needed.
Also, I'd like to have the project back to work as soon as possible but I can keep it broken until the issue is triaged, if needed.

The project is f6d25e17-9bb2-457f-8ce2-e55b5ce1dcd8
The service is 8c4e4e87-7cf3-4ab2-9ec2-b6cc41db7b5b
The deployment is 81cc519b-a7a3-41db-8266-53e08952e935

EDIT: the deploy was triggered yesterday at 2:16 PM CET (GMT+2) and it was working properly at least until today at 1:15 AM CET (GMT+2)

0 Replies

a year ago

Unresponsive deployment after some hours


a year ago

So it sounds like it's still going to sleep? does that sound right to you?


a year ago

the immediate solution would be to deploy your bot into another service and leave the bugged service alone for now


a year ago

side note, I'd be curious to see how a telegram bot that uses webhooks instead of polling would work with app sleeping


a year ago

So it sounds like it's still going to sleep?
It feels like it but I'm having a hard time figuring it out what could be causing it (also given there were little to no changes recently to bot's logic)

the immediate solution would be to deploy your bot into another service and leave the bugged service alone for now
Thanks, I didn't thought about that! I've now deployed the service on my test environment while leaving the production alone. As I suspected I've got no errors from Telegram (which should complain when 2 instances of the same bot are running concurrently), it really feels dead.

Thanks as always @Brody, should I ping again in a few days to see if the issue can be looked thoroughly?


a year ago

if you dont hear back from me by Tuesday please ping, as i plan on bringing this up to the team, and in that case it would be helpful to leave the suspected bugged service untouched if possible


a year ago

hey rob, the applicable person would be off until tmr


a year ago

@Brody I've got a bad news (totally on me): I left automatic deploys enabled and a pull request has been automerged in the night so the faulty deployment is now gone.

For the time being you can ignore this issue, I will ping you if it happens again


a year ago

okay sounds good!


a year ago

Hi @Brody, sorry to bother you once more but it finally happened again. The deployment stuck is e90da47 (service 8c4e4e87-7cf3-4ab2-9ec2-b6cc41db7b5b).

I can't unfortunately disable the connection with my main branch because that would require a redeploy but I will do my best to stop merging code until the issue is looked at (the only issue would be an automerge by renovate during nighttime)


a year ago

so the bot is unresponsive?


a year ago

It wasn't until I deployed it on my test environment (with prod's token). It has been stuck for about 3 and half hours now


a year ago

what makes you think your deployment is being put to sleep instead of soft locking or something similar?


a year ago

Speaking for this deployment only as I can't really recall the oldest ones, it was a freshly deployed instance (5 hours old), it wasn't consuming that many resources and lately I limited the concurrency to max 10 requests at a time
The bot itself never suffered issues causing it to stop working without any sign, I would expect at least a stacktrace but I had none

You can see in the image the point in time when it stopped working, while processing 10 JPEG -> PNG conversions

1259552005857218600


a year ago

you suspect it got slept at 3:12 pm?


a year ago

Somewhere after that, I can only say that's the latest log I had proving the bot was online


a year ago

and is it currently "sleeping" or have you since done a redeploy


a year ago

It is still sleeping to this moment, I left prod deployment there and enabled the test one with the same token and it doesn't result in Telegram's error saying that only one instance of a bot can run concurrently

Also, I never saw its memory decrease in such a "slow" curve to this time as it did after those 3:12 PM (CET).


a year ago

so heres the thing, if it was sleeping the memory reporting would have frozen


a year ago

yet memory reporting continued and did differ


a year ago

now i know i said i planned on bringing this up to the team, but im going to hold off on that for now since this is not looking like a platform issue


a year ago

That would be fine and I get your point of view, I need to find a way to figure it out because as of now I wouldn't know how to triage it


a year ago

is that service on the v2 runtime at least?


a year ago

New builder and runtime v2


a year ago

then i think the next course of action would be to add some very verbose logging so you can try to determine when and where your app is softlocking, and then hopefully why


a year ago

I will see what I can come up with, thanks as always 😄
May I delete the stuck instance? I have no rush on that regard


a year ago

yep!


a year ago

and if you still think this is railway sleeping your service, catch and log sigterm as thats the signal sent when your container is stopped by railway for any reason


a year ago

How confident do you feel saying that the memory of a sleeping service will be frozen until awoken?


a year ago

unless they have since fixed that (doubt it) then that would still be the case


a year ago

yep, metrics are not seeded with zeros during sleep so they would appear as frozen if there was metrics to begin with, but this service is slept for a very long time so there is not enough awake metrics to show either

1259557151110729700


a year ago

How long would it take for those services to go to sleep? I don't think that's my case because I guess I would notice it in the UI


a year ago

10 - 15 minutes maybe


a year ago

but yeah i think its very safe to say that your deployment was not slept


a year ago

Perfect, it doesn't match my deployment alive period