a year ago
We have experienced a very weird issue after pushing a new commit to our service that uses Twitter's Filtered Streams.
This might sound as "out-of-context", but: Twitter allows only one consumer to listen to filtered streams.
After the deployment, we noticed a shutdown of the service from 04 AM GMT+2 and 14 hours after.
I have tried to fix the issue and thought it might be code related, although I did not change any integration logic.
And just now, like 5 minutes ago, I have pushed some commits to another service, our website's backend.
It had an issue that would fail during the deployment. I fixed it.
However, I see logs both from the previous deployment and from the current deployment.
59 Replies
And when the new deployment kicks in, it seems not to shut down the previous one immediately – it feels like it hangs on for random amount of time
although the deployment was like several ahours prior to the start of that long red-blue candle dance, it seems to me that previous failing deployment decided to boot up at 4 am
a year ago
project id, service id, and environment please
for this one,
project id 3e6a2b9c-4e34-41f8-979a-b83b27f3198d
service 0e389c2b-a588-46d2-a58e-b4b76fce4613
environment testnet
a year ago
im sorry but the issue is not clear, you opened this thread and reported issues with two services?
since it has occured two times with different services and different environments (exactly where i pushed new code and caused new deployments to roll out), it made me think something is wrong with railway
a year ago
same issue for both services?
most likely – previous deployment kept running with the new (current) one for too long before stopping
the twitter service literally worked without no flaws for the last two weeks, no twitter integration part was changed
a year ago
you are giving a lot of unorganized information all at once here
a year ago
lets focus on one service at a time, what service would you like me to look into first?
a year ago
okay, can you provide a full UTC timestamp of when you made a new deployment, and the old deployment didnt get killed?
a year ago
Confirming that old services are not being removed properly. It seems like new deployments are not picking up network properly
At Dec 17 02:00 UTC our service started failing – this is quite common in case if our twitter service has a load and has to fight rate limits. there was no load.
the specific details I provided above about twitter are important since twitter api allows only one stream consumer at a time.
it was throwing "Too Many Connections" error, flagging that someone else was consuming the stream – supposingly another replica
a year ago
i can see very spotty metrics during that time, indicating it crash looping
a year ago
but i also see only a single deployment running during that time, at least for the given service and environment id
yes – this is something I architected on purpose so it fails occasionally to bear through rate limits
a year ago
will do
a year ago
same time stamp?
Dec 17 8:48PM - pushed first commit to fix the issue
Dec 17 8:52PM - pushed the "last fix" 🙂
the previous container kept running for at least another 4 minutes (if 8:55PM is the actual time of the "last fix" deployment container start) (i'm in gmt+2)

so the previous deployment kept running for 4 minutes until stopping, although a new one appeared
why did it take it so long to stop? – we also have "Always" as a restart policy here
a year ago
utc timestamps please
it just looks as if there's a race condition between restarting a container and removing it
a year ago
sorry that i have to say this, im not a robot, please give me human read-able timestamps
okay please refer to which kind of timestamps are you talking about, UNIX ones?
a year ago
human readable, like the ones in the screenshot logs, but UTC please
those are UTC, i deducted the local time difference which is two hours as you can see in the screenshot
a year ago
what service is this in what environment
a year ago
okay i think i see the issue here
a year ago
you had [this deploy]() working and online
and then you pushed bad code multiple times, the new code pushes never passed their health checks, so the working code was never taken offline.
this is by design, so the system is working properly here.
a year ago
and yeah for the first service, I see the metrics for one deployment end, and the metrics for a new deployment start, but they do not overlap
a year ago
!s
Status changed to Solved brody • about 1 year ago

