Database service locked up
passos
MODERATOROP

3 months ago

I'm having an issue where my database is simply unreachable, I've tried a re-deploy, restarting it and nothing happens, I can see the logs and it's still up.

that's causing us a downtime

link: https://railway.com/project/357a7a66-a372-47f0-b2ae-2d8e2b6f1f32/service/728a6fbb-44b4-4b44-8bdb-c2566eb0d9d0?environmentId=86853589-64b5-48b3-9eda-2174a7ce26b2&id=6f58bd95-b6f3-4ef3-95fe-bc33626b1f9e#deploy

Solved

9 Replies

brody
EMPLOYEE

3 months ago

Unreachable over the private or public network?


passos
MODERATOROP

3 months ago

over private networking as it seems but weirdely enough I can reach it over Tailscale


passos
MODERATOROP

3 months ago

does this help?

1418658781155561700


brody
EMPLOYEE

3 months ago

What is the source service that is trying to access the database



brody
EMPLOYEE

3 months ago

Can the data tab access it if you add a TCP proxy


passos
MODERATOROP

3 months ago

Yeah I'm able


passos
MODERATOROP

3 months ago

moving it over to us-west also didn't make any difference, healthchecks don't even go through


brody
EMPLOYEE

3 months ago

I'm not seeing other reports, and the database and backend aren't using the beta IPv4 networking, so I'm not sure of the issue.

I'm also not seeing any errors in the logs besides the failing health check?


passos
MODERATOROP

3 months ago

let me try railway ssh


passos
MODERATOROP

3 months ago

psql is able to connect, yeah might be our fault


passos
MODERATOROP

3 months ago

will investigate more


brody
EMPLOYEE

3 months ago

psql over the private network?


passos
MODERATOROP

3 months ago

did ssh into the service container, installed psql there and a connection was made


passos
MODERATOROP

3 months ago

just weird that we're getting these errors from the database

1418663592391348200


passos
MODERATOROP

3 months ago

even tho no deploy was made and the postgres metrics is normal


brody
EMPLOYEE

3 months ago

What is your timeout set to?


passos
MODERATOROP

3 months ago

whatever typeorm uses by default


passos
MODERATOROP

3 months ago

i'll try increasing it but doubt its that


passos
MODERATOROP

3 months ago

even satellite services, with totally different source code than ours, are also unable to connect to our databse


ghaithzamrik
HOBBY

3 months ago

I don't know if it's related but I am having something somewhat similar, one of my services stopped working and when restarting the deploy fails on the health check. it looks like it might be unable to connect to the pg db that I have running, but I can connect to it over public network, (maybe private network issue?) nothing have changed in the service in the last few days no new deployments no changes. Any support would be appreaciated


passos
MODERATOROP

3 months ago

same here, still unable to debug


passos
MODERATOROP

3 months ago

seems like that some connections go through


passos
MODERATOROP

3 months ago

now the problem is also affecting our other project, completely unrelated


ghaithzamrik
HOBBY

3 months ago

One strange thing I noticed is that the "Architecture" UI for a PG DB usually show how much of the db storage is used, and it does for the project that I still have running fine, but not longer does that for the one that is having the problem
see the difference in the screenshots

1418679224494981000
1418679224822403000


passos
MODERATOROP

3 months ago

ohh same here


passos
MODERATOROP

3 months ago

wish I could dettach the volume re-attach to another service


passos
MODERATOROP

3 months ago

tried to do a backup and restore from it, still having issues


passos
MODERATOROP

3 months ago

all of our major providers are still up and no issues whatsover


godiexk
PRO

3 months ago

I have the same problem. I can access it internally from my Node app, but it's inaccessible from an external app. It's not possible to access it from DBeaver or a Java connection.


passos
MODERATOROP

3 months ago

can anyone from the Railway team confirm that they're looking into it? would keep me calm


godiexk
PRO

3 months ago

HELP!! railway team, conexion not found


passos
MODERATOROP

3 months ago

dumping the database and restoring it into another service solved my issue for one of my projects
volume size appears ok without any problems


brody
EMPLOYEE

3 months ago

When did you all first see errors?


passos
MODERATOROP

3 months ago

14:30-14:50 Brazilian time


passos
MODERATOROP

3 months ago

my only issue now is with this database:


brody
EMPLOYEE

3 months ago

I gotta start asking for timestamps in UTC


passos
MODERATOROP

3 months ago

in your timezone:


brody
EMPLOYEE

3 months ago

Please provide a direct link to your database.



brody
EMPLOYEE

3 months ago

I'm sorry but that's not quite what I asked for, please provide the URL of your browser's omni bar while opened to the database.


Railway
BOT

3 months ago

Hello!

We're acknowledging your issue and attaching a ticket to this thread.

We don't have an ETA for it, but, our engineering team will take a look and you will be updated as we update the ticket.

Please reply to this thread if you have any questions!


brody
EMPLOYEE

3 months ago

I've rasied this to the infra team.



godiexk
PRO

3 months ago

Please, my job depends on this, I have clients working who can't use the service.


passos
MODERATOROP

3 months ago

people that highly depends on their service, do a pgdump and pgrestore to another service, I'm in the middle of doing it for another project of ours.


passos
MODERATOROP

3 months ago

also, use an ubuntu container and railway ssh for a faster dump


godiexk
PRO

3 months ago

How do you connect? I can't connect.


passos
MODERATOROP

3 months ago

just did a pgrestore and pgdump for both of our databases and they're back up again, feel free to do anything to those services (well, as long as you don't delete them)


passos
MODERATOROP

3 months ago

make sure to increase your connections count to a really high value and then try to connect


passos
MODERATOROP

3 months ago

our connections were pilling up and thus we were getting too many clients


brody
EMPLOYEE

3 months ago

We are actively looking into the cause.


passos
MODERATOROP

3 months ago

and obviously, run a railway backup just to be sure


godiexk
PRO

3 months ago

it already works!! thanks


passos
MODERATOROP

2 months ago

Hi, can I know what happened?


brody
EMPLOYEE

2 months ago

A host's networking locked up.


Railway
BOT

2 months ago

✅ The ticket Database performance issue has been marked as completed.


passos
MODERATOROP

2 months ago

great to know, would a high availability pg cluster prevent that from happening in the future or was that happening on the service itself? looking for ways to prevent that from happening again.


brody
EMPLOYEE

2 months ago

Unlikely, since something could go wrong with the pooler service, there's still a single point of failure.


passos
MODERATOROP

2 months ago

there's probably someway to replicate that, for the service would replicas do the trick? i dont know if they're deployed to the same host


brody
EMPLOYEE

2 months ago

They are not deployed on the same host, but then your own code would have to handle fallback to another pooler if one isn't available, since we don't handle that on the private network


passos
MODERATOROP

2 months ago

probably i would also need an API gateway to automatically fail over in case a service replica goes down, damn HA is hard 💀


passos
MODERATOROP

2 months ago

fair enough, will look into ways, thanks brody


passos
MODERATOROP

2 months ago

thread can be closed


brody
EMPLOYEE

2 months ago

No problem!


brody
EMPLOYEE

2 months ago

!s


Status changed to Solved brody 3 months ago


Loading...