Sudden ECONNREFUSED on private networking

efstajas

PROOP

5 months ago

3 of our services have suddenly lost communications with redis, PG instances and gone down. There were no deploys or configuration changes from our end.

Service links (of some services that are failing due to ECONNREFUSED):

https://railway.com/project/56cafcfa-394c-46c9-a811-dc3207bad3dc/service/cb0fa421-8d68-4eee-bc51-1733799bea70/database?environmentId=ec032274-ccb3-4372-907d-acc4f1ca967f

https://railway.com/project/56cafcfa-394c-46c9-a811-dc3207bad3dc/service/41cfacc5-0155-4fce-8515-4bc1693a77f2?groupId=be2427f6-4308-48fd-b5a2-991a2eb8cf18&environmentId=ec032274-ccb3-4372-907d-acc4f1ca967f&id=4c838df4-abf2-42ec-9c45-488d1cf7494c#deploy

both are failing to connect to two different redis instances:

https://railway.com/project/56cafcfa-394c-46c9-a811-dc3207bad3dc/service/03969aa1-961a-420d-b153-3c55ff3880b1/database?groupId=be2427f6-4308-48fd-b5a2-991a2eb8cf18&environmentId=ec032274-ccb3-4372-907d-acc4f1ca967f

https://railway.com/project/56cafcfa-394c-46c9-a811-dc3207bad3dc/service/743565a6-fd71-4a42-9eeb-1e04a690d799?environmentId=ec032274-ccb3-4372-907d-acc4f1ca967f

22 Replies

efstajas

PROOP

5 months ago

our uptimekuma monitor is also failing to ping several services via private networking due to ECONNREFUSED - most are fine though

passos

MODERATOR

5 months ago

Hey, can you try redeploying the service in question?

passos

MODERATOR

5 months ago

for people having issues, please create your own help thread, we'll help you there

kenchoong

PRO

5 months ago

yes me too.

rabbitmq all down, unable to connect

SIGTERM received - shutting down

getting this

efstajas

PROOP

5 months ago

did that - redeploys get stuck on migrations in pre-deploy. it seems it cannot connect to the DB (pg) instance either

efstajas

PROOP

5 months ago

more failures now -

https://railway.com/project/56cafcfa-394c-46c9-a811-dc3207bad3dc/service/8c317411-97a8-4123-a568-83113fff997f?groupId=be2427f6-4308-48fd-b5a2-991a2eb8cf18&environmentId=ec032274-ccb3-4372-907d-acc4f1ca967f&id=a517c91e-7466-426e-9222-300a34186754#deploy unable to reach PG

ashakibp

PRO

5 months ago

Yes me too network is borked atm

passos

MODERATOR

5 months ago

Maybe the PG service is the problem here? Any chance you could try a redeploy on your PG instance? that will cause a downtime, but no problem if you can't. That will help us debug the problem more.

efstajas

PROOP

5 months ago

definitely not - it's 3 different PG instances and 2 Redises that cannot be reached by different services atm, plus our uptimekuma cannot ping a number of servers via private networking

passos

MODERATOR

5 months ago

Team is aware and looking into it

efstajas

PROOP

5 months ago

it seems like now all our services are affected, completely down.

efstajas

PROOP

5 months ago

not true - almost all.

passos

MODERATOR

5 months ago

An incident has been called: https://status.railway.com/cmli5y9xt056zsdts5ngslbmp.

efstajas

PROOP

5 months ago

After the outage almost everything recovered, most after a manual restart.

Unfortunately now we need to urgently deploy a script to replay missed webhooks on a critical service and the deployment has been stuck on "Running pre-deploy command" for 21+ minutes after logging success.

https://railway.com/project/56cafcfa-394c-46c9-a811-dc3207bad3dc/service/8c317411-97a8-4123-a568-83113fff997f?groupId=be2427f6-4308-48fd-b5a2-991a2eb8cf18&environmentId=ec032274-ccb3-4372-907d-acc4f1ca967f

passos

MODERATOR

5 months ago

Are you able to abort your deployment and then create a new one?

efstajas

PROOP

5 months ago

already tried that, the second one is stuck since 6 minutes as well. Usually it takes less than 15 seconds to apply migrations on this service.

passos

MODERATOR

5 months ago

cc @Noah can you take a look into this?

noahd

EMPLOYEE

5 months ago

this is most likely due to the current running incident. We have this tracked and will get back to you as soon as I get any info on this

efstajas

PROOP

5 months ago

gonna take a desperate measure to remove the pre deploy command temporarily to push this through since the change does not contain a migration & hope it doesn't make things worse

efstajas

PROOP

5 months ago

By the way, and this is not a problem for us, but just in case it somehow helps diagnose on your end: we have a cron job that's been "running" but actually finished for over 20 minutes here https://railway.com/project/56cafcfa-394c-46c9-a811-dc3207bad3dc/service/29f1876c-1230-4779-8d8b-265f7b71aaff?groupId=be2427f6-4308-48fd-b5a2-991a2eb8cf18&environmentId=ec032274-ccb3-4372-907d-acc4f1ca967f&id=cc16a24c-deee-40d4-b9f6-ad1818eb1346&start=2026-02-11T16%3A00%3A24.970Z&returnTo=cron-schedule

seems like maybe same root cause of somehow missing exits 🤷

efstajas

PROOP

5 months ago

This worked to fix our acute problem. we're back in sync 🙌 Thank you all for the assistance and good luck with the fallout 🦾

noahd

EMPLOYEE

5 months ago

So sorry you hit this and glad you're back!

Welcome!