2 months ago
there is disk I/O performance. my app is currently down in prod. can't even restart or spin a new db service
18 Replies
Postgres deployment stuck on CREATE_CONTAINER for 10+ minutes
Checkpoint sync times are 18.7 seconds (should be <1s)
Started around 2026-03-16 09:20 UTC
Project: 476842f9-1c85-4fd0-9ca0-af9d9249b5b9
2 months ago
Our postgres DB is down too. Does not want to redeploy either
2 months ago
Experiencing the exact same issue, has been down for close to an hour now and no way to redeploy. Extremely annoying.
2 months ago
issue is not on your side
2 months ago
mine too....
2 months ago
mine too is down
2 months ago
mine also, lots of angry customers emailing in
2 months ago
This is insane. Received [NOTICE] Temporary Service Disruption email that my DB will be down for 30mins, it's now hour and a half my service is down because DB is failing. How is this remotely ok?
2 months ago
Hi everyone, I hear you and I'm really sorry you're dealing with this.
What happened was a host went down unexpectedly, which affected a subset of services on Railway including yours. This was not scheduled maintenance, as soon as we detected it, our infra team jumped on it to recover the host. The notifications you received were us letting you know as quickly as we could that something was wrong, not advance notice of planned work.
Unfortunately with unexpected outages like this, we don't get to choose the timing either, and I know that doesn't make the disruption any less impactful.
Database services in particular take a bit longer to come back as they need to safely initialize before accepting connections (your data is safe)
chandrika
Hi everyone, I hear you and I'm really sorry you're dealing with this. What happened was a host went down unexpectedly, which affected a subset of services on Railway including yours. This was not scheduled maintenance, as soon as we detected it, our infra team jumped on it to recover the host. The notifications you received were us letting you know as quickly as we could that something was wrong, not advance notice of planned work. Unfortunately with unexpected outages like this, we don't get to choose the timing either, and I know that doesn't make the disruption any less impactful. Database services in particular take a bit longer to come back as they need to safely initialize before accepting connections (your data is safe)
2 months ago
Hi Chandrika, thank you for the update. For production grade applications this is unacceptable. What guidance can you provide to make sure that we are not affected by such downtime by having redundancy? Would it be to deploy a service to different regions or to have more replicas? I presume replicas on the same server? Regions might be clustered together on the same set of servers and thus be affected by downtime equally.
chandrika
Hi everyone, I hear you and I'm really sorry you're dealing with this. What happened was a host went down unexpectedly, which affected a subset of services on Railway including yours. This was not scheduled maintenance, as soon as we detected it, our infra team jumped on it to recover the host. The notifications you received were us letting you know as quickly as we could that something was wrong, not advance notice of planned work. Unfortunately with unexpected outages like this, we don't get to choose the timing either, and I know that doesn't make the disruption any less impactful. Database services in particular take a bit longer to come back as they need to safely initialize before accepting connections (your data is safe)
2 months ago
It is a shitty situation but our applications depend on your infrastructure resiliency. Expected 2.5h+ downtime because it's impossible to get databases back online quickly shouldn't be the case. This is something you should anticipate and prepare actions for.
2 months ago
my db isn't available as well, its fails to start causing the application to not be functional
2 months ago
We've called an incident for regarding this here: https://status.railway.com/cmmui0c7z012icp7ebcd1a3zv
2 months ago
This thread has been escalated to the Railway team.
Status changed to Awaiting Railway Response chandrika • 2 months ago
Status changed to Awaiting User Response ray-chen • 2 months ago
chandrika
We've called an incident for regarding this here: <https://status.railway.com/cmmui0c7z012icp7ebcd1a3zv>
2 months ago
Just a quick note: if your service does not have a volume attached, please try re-deploying it
budivoogt
Hi Chandrika, thank you for the update. For production grade applications this is unacceptable. What guidance can you provide to make sure that we are not affected by such downtime by having redundancy? Would it be to deploy a service to different regions or to have more replicas? I presume replicas on the same server? Regions might be clustered together on the same set of servers and thus be affected by downtime equally.
2 months ago
But the Postgres cannot be scaled to another regions as far as I see
Status changed to Awaiting Railway Response Railway • 2 months ago
2 months ago
Quick incident update: identified the issue as a hardware failure on a single host in EU West. The affected infrastructure is being brought back online and workloads are recovering, some services have already been restored. Services with databases and attached storage may take a bit longer to fully come back. We'll continue to provide updates as recovery progresses.
Status changed to Awaiting User Response Railway • 2 months ago
2 months ago
We've resolved the incident https://status.railway.com/cmmui0c7z012icp7ebcd1a3zv. If your service has not automatically recovered, please try redeploying. If you're still experiencing issues after that, please let us know here and we'll help. Again, sorry for the disruption.
2 months ago
This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!
Status changed to Solved Railway • about 2 months ago
