a month ago
As of yesterday, we started seeing a high volume of 502 errors within a service of ours that hadn't had any updates pushed to it since the day prior.
Although it's a Flask web app with a relatively high write volume to a Postgres instance that I'm sure we could be a little better about tuning, the recent 502s have been inconsistent with errors we've seen in the past (ie, we used to only see issues with 100-1000s req/sec).
However, as of around 10:30am EST today there was a very high rate of 502 errors in our deployment with a much lower request rate (10s/min).
Scanning through the logs, all the requests had an upstreamErrors field: "upstreamErrors": "[{\"deploymentInstanceID\":\"4c305be8-708d-420b-ad27-27f513fd40d3\",\"duration\":5000,\"error\":\"connection dial timeout\"},{\"deploymentInstanceID\":\"bb638076-d025-4d35-a0d9-1c9bc612f301\",\"duration\":0,\"error\":\"body read after close\"}]" }.
For reference, this instance has 2 replicas.
I would appreciate any support in helping interpret this error. I'm confused as to why deployment 4c305... is always the one with the 'connnection dial timeout' error.
I'd also like to flag that this deployment was one automatically done by Railway and that yesterday morning we experienced a similar issue related to a BullMQ worker of ours redeploying and then being unable to to process any jobs until we redeployed it.
This is somewhat anecdotal, but redeploying this web service at around 10:31-10:32 and have seen no 502s in the past 20 minutes. In contrast, there were probably 50-100 502s in the 20 minutes leading up to redeploying.
9 Replies
a month ago
As an update, we haven't observed any 502s since redeploying.
a month ago
Thanks for the update.
We had a networking issue where a host that wasn't ready to serve traffic got into rotation. This likely also caused the BullMQ worker issue you saw the day before. We've since removed the host and things should be good to go. Let us know if you see any further issues!
Status changed to Awaiting User Response Railway • 30 days ago
a month ago
Is there something we should do going forward to mitigate the risk of this impacting production users? Also, can Railway not redeploy (or at least alert us) when you all rotate in a host that is not functional? This caused intermittent outages for users across multiple different services in production and was fixed immediately when we woke up to 100+ support requests and redeployed
Status changed to Awaiting Railway Response Railway • 30 days ago
a month ago
Completely understand and I agree on notifying. We have a project on the board we're working on to better approach notifying specific users of sudden events.
In this instance it was unexpected and should not happen again.
Status changed to Awaiting User Response Railway • 30 days ago
a month ago
Hey, just wanted to flag that this happened again to us on a Railway auto-deploy. A non-essential (thankfully) API of ours randomly experienced something very similar where it was working fine and then randomly began seeing a 100% error rate as of the morning of Feb 8. Redeploying immediately saw traffic restored to normal.
Attachments
Status changed to Awaiting Railway Response Railway • 27 days ago
a month ago
Hello!
We're acknowledging your issue and attaching a ticket to this thread.
We don't have an ETA for it, but, our engineering team will take a look and you will be updated as we update the ticket.
Please reply to this thread if you have any questions!
a month ago
Thanks for flagging the Feb 8 recurrence. We've opened a ticket with our platform team to investigate why this happened again after the initial fix. We'll follow up once we have more information.
Status changed to Awaiting User Response Railway • 26 days ago
18 days ago
Flagging another instance of this, except this time there was no "Autodeployed by railway notice". The error pattern is the exact same--random mass 502s starting in the middle of the night EST. Again, a simple redeployment fixed this.
Please investigate this. We love Railway for its DX but it's becoming less and less tenable to host services with high-availability requirements.
Attachments
Status changed to Awaiting Railway Response Railway • 18 days ago
17 days ago
Hello,
An incident we reported earlier caused this issue, and we have since resolved it.
https://status.railway.com/cmls3jnha062369cv7ridunsf
We are deeply sorry about the inconvenience this caused, and we are already working on improvements that would have mitigated this issue entirely.
Best,
Brody
Status changed to Awaiting User Response Railway • 17 days ago
