Service not waking from sleep + HTTP 502 Bad Gateway

lawrencejob

PROOP

2 years ago

Hi. I appreciate Railway a lot as it has helped me grow my next startup to a proof of concept very quickly. I am now preparing it for scale to launch and have run into a few issues, one of which being critical.

My facade service slept a few weeks ago. Since then, any requests to it received 502 Bad Gateway, which typically happens on cold starts. However, this time it did not wake on request. Requests to the server were all met with 502s.

requestId:
"ualV3KotSeKLKSeBISODuQ_3165824431"
timestamp:
"2024-10-04 19:43:29.023194452"
method:
"GET"
path:
"/"
host:
"uk.api.departures.app"
httpStatus:
502
upstreamProto:
"HTTP/1.1"
downstreamProto:
"HTTP/2.0"
responseDetails:
"failed to forward request to upstream: connection dial timeout"
totalDuration:
5000
upstreamAddress:
"http://[fd12:8321:87ae::52:a0c8:ffc6]:8080"
clientUa:
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"
upstreamRqDuration:
5000
txBytes:
4690
rxBytes:
698
srcIp:
"66.232.62.79"
edgeRegion:
"us-east4"

However, not all of them are being logged to the HTTP logs. I probably made ~50 requests over the last 6 hours, but none of them yielded a response and only three of them were stored in the HTTP logs.

Prior requests (before today) were met with 499 and also failed to start, although that was alpha traffic so I have no device-end diagnostics for those errors. The service was sleeping during the 499 errors.

499: "client has closed the request before the server could send a response"

I originally tried not to do anything, to preserve the reproduction of the problem, but as part of diagnosing I eventually was forced to redeploy the service, which fixed the issue.

Incidentally, at the exact time I was seeing this issue, a separate but related issue was posted in the forum.

For detail, the app is very basic, written in Go, and is a very undemanding facade API.

All of this leads me, naively, to believe there was an issue with the routing/ingress configuration issue. Either this, or the dashboard was showing 'Sleeping' (in the yellow box) when in actual fact it was in a different state. Is there anything further I can do to help report this bug? Is it a common issue?

Thanks in advance.

9 Replies

brody

EMPLOYEE

2 years ago

Hello,

Can you go ahead and redeploy the service?

lawrencejob

PROOP

2 years ago

Hi - I did redeploy the service, which did fix it for now. I'm not sure how it happened and how to prevent it from happening again? Is it linked to an infrastructure change?

brody

EMPLOYEE

2 years ago

Perhaps it wasn't awoken for 2 weeks and thus the image was removed as Railway only keeps the image for 2 weeks.

brody

EMPLOYEE

2 years ago

As for the 499s, that means the client closed the connection.

lawrencejob

PROOP

2 years ago

Hi - you are absolutely right - it was sleeping for 2 weeks. I didn't intuit that the image retention would apply to sleeping-but-deployed services. I will have to turn off sleeping for my services. Is the two weeks from the time it went into sleep or from the time it was last deployed?

As for the 499, that is correct. I wonder why it would show as the client closing the connection if the service was sleeping?

Edit; thank you for your quick help, by the way. I really appreciate it.

brody

EMPLOYEE

2 years ago

Time last deployed.

I'm not sure why it would show that, something odd with the client.

lawrencejob

PROOP

2 years ago

Thanks so much. One last question if you don't mind so that I am clear: if I deploy my app to prod and it runs for, say, a month, but it sleeps for 5 minutes, it will never restart?

(You're right - I think 499 is what happened when an iOS device accessed the sleeping API. I will definitely investigate that independently...)

If we can't use the sleep feature in prod, at least I don't have to solve the intermittent 502-while-waking bug any more.

brody

EMPLOYEE

2 years ago

I'm actually not sure about that question, sorry, but you absolutely should not use sleeping in prod even if you didn't have to contend with the 2 week issue or the 502s on cold boot.

P.S. 502s on cold boot will be fixed.

brody

EMPLOYEE

2 years ago

Hello again!

We've resolved an issue where apps with longer startup times were showing 502 errors. Apps now have up to 10 seconds to start accepting traffic, thus preventing these error pages from appearing.

You will need to trigger a deployment so that the changes we have made take effect.

Welcome!