Database service locked up

6 months ago

I'm having an issue where my database is simply unreachable, I've tried a re-deploy, restarting it and nothing happens, I can see the logs and it's still up.

that's causing us a downtime

link: https://railway.com/project/357a7a66-a372-47f0-b2ae-2d8e2b6f1f32/service/728a6fbb-44b4-4b44-8bdb-c2566eb0d9d0?environmentId=86853589-64b5-48b3-9eda-2174a7ce26b2&id=6f58bd95-b6f3-4ef3-95fe-bc33626b1f9e#deploy

Solved

66 Replies

6 months ago

Unreachable over the private or public network?


6 months ago

over private networking as it seems but weirdely enough I can reach it over Tailscale


6 months ago

does this help?

1418658781155561700


6 months ago

What is the source service that is trying to access the database



6 months ago

Can the data tab access it if you add a TCP proxy


6 months ago

Yeah I'm able


6 months ago

moving it over to us-west also didn't make any difference, healthchecks don't even go through


6 months ago

I'm not seeing other reports, and the database and backend aren't using the beta IPv4 networking, so I'm not sure of the issue.

I'm also not seeing any errors in the logs besides the failing health check?


6 months ago

let me try railway ssh


6 months ago

psql is able to connect, yeah might be our fault


6 months ago

will investigate more


6 months ago

psql over the private network?


6 months ago

did ssh into the service container, installed psql there and a connection was made


6 months ago

just weird that we're getting these errors from the database

1418663592391348200


6 months ago

even tho no deploy was made and the postgres metrics is normal


6 months ago

What is your timeout set to?


6 months ago

whatever typeorm uses by default


6 months ago

i'll try increasing it but doubt its that


6 months ago

even satellite services, with totally different source code than ours, are also unable to connect to our databse


ghaithzamrik
PRO

6 months ago

I don't know if it's related but I am having something somewhat similar, one of my services stopped working and when restarting the deploy fails on the health check. it looks like it might be unable to connect to the pg db that I have running, but I can connect to it over public network, (maybe private network issue?) nothing have changed in the service in the last few days no new deployments no changes. Any support would be appreaciated


6 months ago

same here, still unable to debug


6 months ago

seems like that some connections go through


6 months ago

now the problem is also affecting our other project, completely unrelated


ghaithzamrik
PRO

6 months ago

One strange thing I noticed is that the "Architecture" UI for a PG DB usually show how much of the db storage is used, and it does for the project that I still have running fine, but not longer does that for the one that is having the problem
see the difference in the screenshots

1418679224494981000
1418679224822403000


6 months ago

ohh same here


6 months ago

wish I could dettach the volume re-attach to another service


6 months ago

tried to do a backup and restore from it, still having issues


6 months ago

all of our major providers are still up and no issues whatsover


godiexk
PRO

6 months ago

I have the same problem. I can access it internally from my Node app, but it's inaccessible from an external app. It's not possible to access it from DBeaver or a Java connection.


6 months ago

can anyone from the Railway team confirm that they're looking into it? would keep me calm


godiexk
PRO

6 months ago

HELP!! railway team, conexion not found


6 months ago

dumping the database and restoring it into another service solved my issue for one of my projects
volume size appears ok without any problems


6 months ago

When did you all first see errors?


6 months ago

14:30-14:50 Brazilian time


6 months ago

my only issue now is with this database:


6 months ago

I gotta start asking for timestamps in UTC


6 months ago

in your timezone:


6 months ago

Please provide a direct link to your database.



6 months ago

I'm sorry but that's not quite what I asked for, please provide the URL of your browser's omni bar while opened to the database.


Railway
BOT

6 months ago

Hello!

We're acknowledging your issue and attaching a ticket to this thread.

We don't have an ETA for it, but, our engineering team will take a look and you will be updated as we update the ticket.

Please reply to this thread if you have any questions!


6 months ago

I've rasied this to the infra team.



godiexk
PRO

6 months ago

Please, my job depends on this, I have clients working who can't use the service.


6 months ago

people that highly depends on their service, do a pgdump and pgrestore to another service, I'm in the middle of doing it for another project of ours.


6 months ago

also, use an ubuntu container and railway ssh for a faster dump


godiexk
PRO

6 months ago

How do you connect? I can't connect.


6 months ago

just did a pgrestore and pgdump for both of our databases and they're back up again, feel free to do anything to those services (well, as long as you don't delete them)


6 months ago

make sure to increase your connections count to a really high value and then try to connect


6 months ago

our connections were pilling up and thus we were getting too many clients


6 months ago

We are actively looking into the cause.


6 months ago

and obviously, run a railway backup just to be sure


godiexk
PRO

6 months ago

it already works!! thanks


6 months ago

Hi, can I know what happened?


6 months ago

A host's networking locked up.


Railway
BOT

6 months ago

✅ The ticket Database performance issue has been marked as completed.


6 months ago

great to know, would a high availability pg cluster prevent that from happening in the future or was that happening on the service itself? looking for ways to prevent that from happening again.


6 months ago

Unlikely, since something could go wrong with the pooler service, there's still a single point of failure.


6 months ago

there's probably someway to replicate that, for the service would replicas do the trick? i dont know if they're deployed to the same host


6 months ago

They are not deployed on the same host, but then your own code would have to handle fallback to another pooler if one isn't available, since we don't handle that on the private network


6 months ago

probably i would also need an API gateway to automatically fail over in case a service replica goes down, damn HA is hard 💀


6 months ago

fair enough, will look into ways, thanks brody


6 months ago

thread can be closed


6 months ago

No problem!


6 months ago

!s


Status changed to Solved brody 6 months ago


Loading...