Outage?

nickmPRO

a year ago

Are you experiencing any issues? We're in Singapore. Just checking the public channel since private support aren't responding. Builds aren't working and 5 different environments are down
3c08e827-8d73-4a37-bbe9-9af9757bd354

21 Replies

ralengPRO

a year ago

We have a service down as well in Singapore.



nickmPRO

a year ago

Sad state of affairs on our production infrastructure

1236166901701283800


devcsrjPRO

a year ago

Same here - nothing is getting deployed at the moment


nickmPRO

a year ago

Ping


a year ago

please check <#846875565357006878> for updates


nickmPRO

a year ago

Thanks Brody! I will now that there's one there


nickmPRO

a year ago

no available stackers found within resource limits on an attempted redeploy



a year ago

Hi Nick please standby we are investigating, incident has been called


nickmPRO

a year ago

Thanks david, adding some context where I have it in case it helps debugging


nickmPRO

a year ago

We came back online ~30 mins ago. Now we're back offline as of ~4 mins ago


a year ago

Our apps and services are still down as well, tried migrating to US region, no luck.


khoavn02TRIAL

a year ago

Pls help, I can't connect to postgres db any more


nickmPRO

a year ago

Still down for us too.


khoavn02TRIAL

a year ago

Do you have backup, I'm thinking of migrate database to other provider


nickmPRO

a year ago

Don't be too hasty – this should be resolved soon (given how long it took last time) though I'm not aware of your requirements. At a certain point that'd have to be an option but for us we won't as yet.


a year ago

Starting to see our services up now…


nickmPRO

a year ago

Thanks partbot, trying to redeploy but no luck as yet. I'll also check in when we're up


a year ago

Update: Partial recovery, 50% of capacity restored. Actively working on the rest. Thanks for your patience, on-call team working as swiftly as possible to restore service.


a year ago

thanks david


jtechbitPRO

a year ago

ETA on full capacity restoration?


nickmPRO

a year ago

Thanks David and team


jtechbitPRO

a year ago

Time for another update? Just a reminder that people have production infrastructure that is affected.


rendercoderPRO

a year ago

I just deployed services in the Singapore region and encountered a similar issue. Unable to deploy service successfully

1236247092826210300


nickmPRO

a year ago

Still down, i'm trying regularly to re-deploy to no avail


jtechbitPRO

a year ago

The level of communication from Railway on this incident is totally unacceptable. I hope processes can be improved as a result of the post-mortem. Even just a “we are continuing to work on it” would give some confidence an on-call team is actually working on this…


nickmPRO

a year ago

My production systems have been down 4 hours in this downtime, and in total 6 hours 15 mins today. So far


a year ago

Update: The core issue has been identified and a resolution is in progress to restore service. The on-call team is working to roll it out.


nickmPRO

a year ago

I'm online now. Redeploying worked


devcsrjPRO

a year ago

4 out of my 5 services redeployed properly. One more still haven't recovered. Might take a while more for the fix to be rolled out


nickmPRO

a year ago

Almost 11pm here, going to be a nervous night's sleep given the day of issues.

Thanks for getting it resolved team. Echoing jtechbit – not enough comms given the severity


a year ago

Thanks for the feedback, acknowledged. That's on me personally for not communicating more. We've had the full on-call team on this (with several additional engineers joining) for as many hours as service has been down.


a year ago

Full service restoration in sight.


a year ago

Fix implemented. Resolved.


jtechbitPRO

a year ago

Thank you for the update David! My services are now responding normally.


a year ago

We've published a full incident retro here: https://blog.railway.app/p/2024-05-04-incident-report