Loading...

Flask app hanging, timing out due to PostrgeSQL connection

vlatanHOBBY

5 months ago

I had this problem where on deployment the app had to do some preliminary queries to the database but the app was crashing 2 out 10 times on startup because it was not able to establish the connection to DB at all, because of operational error (device not known or something). The app starts but there's no DB on the designated host and port. Internal host btw.

But not only that, even if the app started it would randomly stop working (not crashing) and start timing out all of the requests. In the DB logs I would see:

"unexpected EOF on client connection"
"could not receive data from client: Connection reset by peer"

I read somewhere in Railway docs that I should sleep the deployment for 3 seconds (in the deployment command) so the Railway network to be able to sort out the connection to DB (PostgreSQL in this case). I did that but the problem persisted intermittently. So I removed the preliminary DB querying to avoid the app crashing on startup.

However, I am now in this weird situation where the debugging is impossible. The app deploys fine bit then randomly (after days or even weeks) starts timing out on all paths and in the DB I would see:

"unexpected EOF on client connection"
"could not receive data from client: Connection reset by peer"

The app does not produce any errors in the deployment logs and does not crash, but just returns foul HTTP responses (499 - client has closed the request before the server could send a response or 502 - failed to forward request to upstream: connection dial timeout).

I would have expected to see some warnings, errors or exceptions in the deployment logs, but there's nothing. The deployment logs just stop with the last known 200 response from the app, while the HTTP logs continue with bad responses. Also the timing out happens on ALL URL paths, which of course ALL require a trip to the DB.

The debugging seems impossible because the problem is not reproducible and it's intermittent. I could be wrong, but I am pretty sure there's something with the DB connectivity.

Closed

3 Replies

brodyEMPLOYEE

5 months ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open brody • 5 months ago

brodyEMPLOYEE

5 months ago

> I read somewhere in Railway docs that I should sleep the deployment for 3 seconds

That is outdated info for a Legacy system that you are not using.

> The app does not produce any errors in the deployment logs and does not crash
> I would have expected to see some warnings, errors or exceptions in the deployment logs, but there's nothing

If there is nothing in your logs, it means you're not logging any errors, this is something you are responsible for, not Railway.

> I could be wrong, but I am pretty sure there's something with the DB connectivity.

This is a application level issue, not anything to do with the platform.

Going forward I highly recommend you add tracing to your code so you can figure out exactly what piece of your code is causing the slowdown.

Additionally, please connect to redis over the private network.

vlatanHOBBY

5 months ago

That is outdated info for a Legacy system that you are not using.

Yes I know that but even after upgrading to the V2 or whatever it was called I still had the problem so I left the sleep time, just in case.

If there is nothing in your logs, it means you're not logging any errors, this is something you are responsible for, not Railway.

You don't say? I certainly log the errors, bit for some reason this particular error doesn't show up in the deploy logs. The request just hangs for a very long time and that's it.

This is a application level issue, not anything to do with the platform.

We'll see about that, if I don't migrate somewhere else in the meantime. It could be bug in the code, it could be the network.

Going forward I highly recommend you add tracing to your code so you can figure out exactly what piece of your code is causing the slowdown.

Yeah, that will be painful since the problem is intermittent, it might not show up for weeks. Plus there's no hint to where to trace. But I'll figure something out to pinpoint this behaviour.

brodyEMPLOYEE

5 months ago

I'm really sorry but we are unable to provide support for this issue as this would be an application level issue and not a platform level issue.

Status changed to Closed brody • 5 months ago