19 days ago
Our healthchecks started failing off hours (no deploys from our team). Digging deeper we discovered three of our services were no longer able to reach one of the other services. Investigating, we saw Railway initiated a redeploy of that service that was no longer reachable. This was the same issue we encountered last week during the major incident/outage.
To resolve, we redeployed the unreachable service, and all three of the other services that were trying to talk to it (restarting did not work).
Some additional debugging information:
generating an external domain for the unreachable service worked - we were able to use the external domain to talk to our service
internal network communications were timing out - not "connection refused"
This makes me think there is some DNS action that isn't running when Railway initiates a redeploy. Is the internal domain still pointing to the old/spun down container?
Either way, for other folks that might be experiencing this the tldr; resolution is:
redeploy the services that are trying to talk to the service that Railway initiated the deploy on
5 Replies
19 days ago
Pretty much right when we solved the incident with the build issues, we experienced an internal network routing issue with some new hosts we have brought online. We have since fixed that and have seen no more 502s or other private network routing issues.
Incredibly sorry for the impact here.
Status changed to Awaiting User Response Railway • 19 days ago
brody
Pretty much right when we solved the incident with the build issues, we experienced an internal network routing issue with some new hosts we have brought online. We have since fixed that and have seen no more 502s or other private network routing issues.Incredibly sorry for the impact here.
19 days ago
Thanks for responding. This report is for an incident that happened tonight around 7:37pm pacific. It matched the same pattern we saw last week.
Status changed to Awaiting Railway Response Railway • 19 days ago
19 days ago
We are deeply sorry for the inconvenience caused tonight, and we already have additional monitoring for this put in place.
Status changed to Awaiting User Response Railway • 19 days ago
brody
We are deeply sorry for the inconvenience caused tonight, and we already have additional monitoring for this put in place.
19 days ago
Got it. Sorry I thought you were referring to last week's incident - not the one that happened today.
Can you share why Railway is initiating redeploys on our services? What is the expected cadence of this so we can be aware?
Status changed to Awaiting Railway Response Railway • 19 days ago
19 days ago
That would be underlying host maintenance, or to be specific for the instances right now, host upgrades.
Status changed to Awaiting User Response Railway • 19 days ago
Status changed to Solved jasonfma • 19 days ago