a year ago
Hi Railway, our deployments are experiencing intermittent network issues when connecting to different services deployed also within Railway. It doesn't matter if we call other services either through their public hostname or the private hostname. For the same request, we sometimes get a 200 and sometimes it's a 503. This is making our deployments being highly unstable, we've been getting numerous monitoring downtimes because of this. I screenshot our resource usages, it's very low. Here are our projects:
https://railway.app/project/56943544-81c5-486d-8741-d8cca3f88ed1
https://railway.app/project/2f705f5a-06d6-45ac-871e-f5b0a7690fa7
https://railway.app/project/2119123b-28c6-4ba9-97e5-75be1b52dcc1
https://railway.app/project/c9986c4d-72c9-46ad-9186-160d3e9c1d44
Would this be relatable to the new TCP proxy upgrade you guys did recently? Any help is appreciated, thank you.
6 Replies
a year ago
i too got random 503's calling other railway services from within railway earlier today
a year ago
getting the team involved
a year ago
Hey there @Tista - the infra team is conducting an investigation, can you provide some timestamps for us to narrow the problem down?
a year ago
Hi I have a frontend on vercel and my backend hosted on railway is also throwing a 503. It just says application failed to respond and I do not see anything in the logs!
a year ago
Hi I have a frontend on vercel and my backend hosted on railway is also throwing a 503. It just says application failed to respond and I do not see anything in the logs!
Can you also provide additional information like timestamps and project-ids?
a year ago
@angelo all my requests are 503 so the server is completely down.
Last successful request I see is from 10:38.
I do not have a health endpoint configured so I am not sure when it actually started giving the error after that.
Is it possible to dm the project id?
Hey Angelo, thank you for replying. I'm on UTC +4, we've been experiencing this since 4:54AM this morning (March 8) and it's still ongoing, we're still receiving random 503s. Would appreciate the root cause through your investigation, thank you.
a year ago
Yep- updating, we have found the source of the affected resources, can you trigger a redeploy for your services? This will land your workload on a different resource.
a year ago
Checking in.
a year ago
@angelo
Redploy failed for me
Container failed to start
=========================
We failed to create a container for this image.
a year ago
@angelo Second redeploy fixed it. Do we know what the issue was?
a year ago
Going to leave the final investigation for the Infra team as they address and fix the issue, glad you are resolved for now.
We've restarted deployments in all of our projects, still monitoring for 503s
a year ago
have you restarted the deployments that your services are making requests to?
I've got a question related to this issue - when was the underlying issue introduced?
We've experienced same issues yesterday 7PM UTC, and after restarting the service that was unavailable, it worked fine for couple of hours. Today we're experiencing similar issues, as described above, and again - redeplyoment worked for the time being
a year ago
according to my logs the first error appeared 2024-03-07T07:50:01.818978082Z UTC
aka March 7th 7:50AM UTC
a year ago
hey @Tista @andrzej | t2 the incident has now been resolved, you can read about the reasoning here https://discord.com/channels/713503345364697088/846875565357006878/1215585864286081034
if you are still experiencing this issue please do another set of redeploys.
Yeah i’ve read it, thanks for the quick turn around. I can confirm we’re no longer getting 503s.
a year ago
happy to hear that!