Railway seems to not shut down previous deployments even after a new build & deployment has occured
dalechyn
PROOP

a year ago

We have experienced a very weird issue after pushing a new commit to our service that uses Twitter's Filtered Streams.
This might sound as "out-of-context", but: Twitter allows only one consumer to listen to filtered streams.

After the deployment, we noticed a shutdown of the service from 04 AM GMT+2 and 14 hours after.
I have tried to fix the issue and thought it might be code related, although I did not change any integration logic.

And just now, like 5 minutes ago, I have pushed some commits to another service, our website's backend.
It had an issue that would fail during the deployment. I fixed it.
However, I see logs both from the previous deployment and from the current deployment.

Solved

59 Replies

dalechyn
PROOP

a year ago

One important notice is that we have "Always restart" policy


dalechyn
PROOP

a year ago

And when the new deployment kicks in, it seems not to shut down the previous one immediately – it feels like it hangs on for random amount of time


dalechyn
PROOP

a year ago

1318687036491956500


dalechyn
PROOP

a year ago

here's the twitter issue i mentioned at the start

1318687665973104600


dalechyn
PROOP

a year ago

although the deployment was like several ahours prior to the start of that long red-blue candle dance, it seems to me that previous failing deployment decided to boot up at 4 am


dalechyn
PROOP

a year ago

or was never off


dalechyn
PROOP

a year ago

service 0b9228e6-0528-4f63-bc97-157a8909cc2b


brody
EMPLOYEE

a year ago

project id, service id, and environment please


dalechyn
PROOP

a year ago

project id 3e6a2b9c-4e34-41f8-979a-b83b27f3198d


dalechyn
PROOP

a year ago

environment mainnet


dalechyn
PROOP

a year ago

it's for this one


dalechyn
PROOP

a year ago

for this one,
project id 3e6a2b9c-4e34-41f8-979a-b83b27f3198d
service 0e389c2b-a588-46d2-a58e-b4b76fce4613
environment testnet


brody
EMPLOYEE

a year ago

im sorry but the issue is not clear, you opened this thread and reported issues with two services?


dalechyn
PROOP

a year ago

since it has occured two times with different services and different environments (exactly where i pushed new code and caused new deployments to roll out), it made me think something is wrong with railway


dalechyn
PROOP

a year ago

yes, you can check any


brody
EMPLOYEE

a year ago

same issue for both services?


dalechyn
PROOP

a year ago

most likely – previous deployment kept running with the new (current) one for too long before stopping


dalechyn
PROOP

a year ago

for this case this was literally hours


dalechyn
PROOP

a year ago

for this case this was 7 minutes


dalechyn
PROOP

a year ago

the twitter service literally worked without no flaws for the last two weeks, no twitter integration part was changed


dalechyn
PROOP

a year ago

i pushed a new commit to tune the ai prompt and here we are


brody
EMPLOYEE

a year ago

you are giving a lot of unorganized information all at once here


dalechyn
PROOP

a year ago

ok, guide me, what do you need exactly to investigate the issue further?


brody
EMPLOYEE

a year ago

lets focus on one service at a time, what service would you like me to look into first?


dalechyn
PROOP

a year ago

let's look at this one


brody
EMPLOYEE

a year ago

okay, can you provide a full UTC timestamp of when you made a new deployment, and the old deployment didnt get killed?


drmarshall
PRO

a year ago

Confirming that old services are not being removed properly. It seems like new deployments are not picking up network properly


brody
EMPLOYEE

a year ago

DrMarshall,

Please open your own thread.


dalechyn
PROOP

a year ago

At Dec 16 21:56 UTC I have pushed the commit


dalechyn
PROOP

a year ago

1734386160


dalechyn
PROOP

a year ago

At Dec 17 02:00 UTC our service started failing – this is quite common in case if our twitter service has a load and has to fight rate limits. there was no load.
the specific details I provided above about twitter are important since twitter api allows only one stream consumer at a time.
it was throwing "Too Many Connections" error, flagging that someone else was consuming the stream – supposingly another replica


brody
EMPLOYEE

a year ago

i can see very spotty metrics during that time, indicating it crash looping


brody
EMPLOYEE

a year ago

but i also see only a single deployment running during that time, at least for the given service and environment id


dalechyn
PROOP

a year ago

yes – this is something I architected on purpose so it fails occasionally to bear through rate limits


dalechyn
PROOP

a year ago

well then I'd suggest to check the second case


brody
EMPLOYEE

a year ago

will do


dalechyn
PROOP

a year ago

with this details


brody
EMPLOYEE

a year ago

same time stamp?


dalechyn
PROOP

a year ago

Dec 17 8:48PM - pushed first commit to fix the issue
Dec 17 8:52PM - pushed the "last fix" 🙂

the previous container kept running for at least another 4 minutes (if 8:55PM is the actual time of the "last fix" deployment container start) (i'm in gmt+2)

1318698529484701700


dalechyn
PROOP

a year ago

so the previous deployment kept running for 4 minutes until stopping, although a new one appeared


dalechyn
PROOP

a year ago

why did it take it so long to stop? – we also have "Always" as a restart policy here


brody
EMPLOYEE

a year ago

utc timestamps please


dalechyn
PROOP

a year ago

1734468925 - new container started


dalechyn
PROOP

a year ago

1734469175 - old container stopped


dalechyn
PROOP

a year ago

it just looks as if there's a race condition between restarting a container and removing it


brody
EMPLOYEE

a year ago

sorry that i have to say this, im not a robot, please give me human read-able timestamps


dalechyn
PROOP

a year ago

okay please refer to which kind of timestamps are you talking about, UNIX ones?


dalechyn
PROOP

a year ago

no worries i'm not a robot too beep-bop


brody
EMPLOYEE

a year ago

human readable, like the ones in the screenshot logs, but UTC please


dalechyn
PROOP

a year ago

those are UTC, i deducted the local time difference which is two hours as you can see in the screenshot


dalechyn
PROOP

a year ago

8:55:25 PM - started new container


dalechyn
PROOP

a year ago

8:59:35 PM - stopped previous container


brody
EMPLOYEE

a year ago

what service is this in what environment


dalechyn
PROOP

a year ago

this one


brody
EMPLOYEE

a year ago

okay i think i see the issue here


brody
EMPLOYEE

a year ago

you had [this deploy]() working and online

and then you pushed bad code multiple times, the new code pushes never passed their health checks, so the working code was never taken offline.

this is by design, so the system is working properly here.


dalechyn
PROOP

a year ago

hmm gotchu ty


brody
EMPLOYEE

a year ago

and yeah for the first service, I see the metrics for one deployment end, and the metrics for a new deployment start, but they do not overlap


brody
EMPLOYEE

a year ago

!s


Status changed to Solved brody about 1 year ago


Loading...