3 months ago
3 of our services have suddenly lost communications with redis, PG instances and gone down. There were no deploys or configuration changes from our end.
Service links (of some services that are failing due to ECONNREFUSED):
both are failing to connect to two different redis instances:
22 Replies
our uptimekuma monitor is also failing to ping several services via private networking due to ECONNREFUSED - most are fine though
3 months ago
Hey, can you try redeploying the service in question?
3 months ago
for people having issues, please create your own help thread, we'll help you there
yes me too.
rabbitmq all down, unable to connect
SIGTERM received - shutting down
getting this
did that - redeploys get stuck on migrations in pre-deploy. it seems it cannot connect to the DB (pg) instance either
3 months ago
Maybe the PG service is the problem here? Any chance you could try a redeploy on your PG instance? that will cause a downtime, but no problem if you can't. That will help us debug the problem more.
definitely not - it's 3 different PG instances and 2 Redises that cannot be reached by different services atm, plus our uptimekuma cannot ping a number of servers via private networking
3 months ago
Team is aware and looking into it
3 months ago
An incident has been called: https://status.railway.com/cmli5y9xt056zsdts5ngslbmp.
After the outage almost everything recovered, most after a manual restart.
Unfortunately now we need to urgently deploy a script to replay missed webhooks on a critical service and the deployment has been stuck on "Running pre-deploy command" for 21+ minutes after logging success.
3 months ago
Are you able to abort your deployment and then create a new one?
already tried that, the second one is stuck since 6 minutes as well. Usually it takes less than 15 seconds to apply migrations on this service.
3 months ago
cc @Noah can you take a look into this?
3 months ago
this is most likely due to the current running incident. We have this tracked and will get back to you as soon as I get any info on this
gonna take a desperate measure to remove the pre deploy command temporarily to push this through since the change does not contain a migration & hope it doesn't make things worse
By the way, and this is not a problem for us, but just in case it somehow helps diagnose on your end: we have a cron job that's been "running" but actually finished for over 20 minutes here https://railway.com/project/56cafcfa-394c-46c9-a811-dc3207bad3dc/service/29f1876c-1230-4779-8d8b-265f7b71aaff?groupId=be2427f6-4308-48fd-b5a2-991a2eb8cf18&environmentId=ec032274-ccb3-4372-907d-acc4f1ca967f&id=cc16a24c-deee-40d4-b9f6-ad1818eb1346&start=2026-02-11T16%3A00%3A24.970Z&returnTo=cron-schedule
seems like maybe same root cause of somehow missing exits 🤷
This worked to fix our acute problem. we're back in sync 🙌 Thank you all for the assistance and good luck with the fallout 🦾
3 months ago
So sorry you hit this and glad you're back!