Help investigating service downtime

We observed one of our core services becoming unresponsive today between 3:56 and 4:06 PST. HTTP requests to the service were timing out and no new logs were showing up in Railway's UI for the service. We were also unable to load metrics for the service in Railway's UI.

We re-deployed the service and it is now healthy, and we can now see the downtime visualized in the metrics (see the attached screenshot).

This service was in our development environment, which is on Railway Metal, not our production environment, which is still on the legacy GCP infrastructure.

Did Railway have any known outages during that time? I see this issue on the status page, but that is referencing GCP, which does not correspond with where we saw the issue today.

Attachments

Solved

7 Replies

Railway
BOT

24 days ago

Hey there! We've found the following might help you get unblocked faster:

If you find the answer from one of these, please let us know by solving the thread!


Hello there James,

Unfortunately yes, that workload was on a host that had to be restarted. It appears when calling the incident, the internal tool I use is referencing the old host set. I apologize and I will rectify this moving forward. Hopefully this didn't lead to much impact.


Status changed to Awaiting User Response Railway 24 days ago


thomas-arcol
PRO

24 days ago

Hi Angelo, so to confirm, the status page incident that James linked was the underlying issue that caused our service to become unavailable, and the incident was tagged with the incorrect metadata about what cluster was impacted (since the incident says GCP, but the impacted service was on Metal)?

Will there be more flavor shared about the nature of the underlying issue? The timestamps on the incident linked don't appear to align with what we observed; it would be great to learn more what went wrong.


Status changed to Awaiting Railway Response Railway 24 days ago


thomas-arcol

Hi Angelo, so to confirm, the status page incident that James linked was the underlying issue that caused our service to become unavailable, and the incident was tagged with the incorrect metadata about what cluster was impacted (since the incident says GCP, but the impacted service was on Metal)?Will there be more flavor shared about the nature of the underlying issue? The timestamps on the incident linked don't appear to align with what we observed; it would be great to learn more what went wrong.

Yes for the above.

I am unsure what you mean by the timestamps, at that timeframe we did call when we noticed that single host went unhealthy. It may be delayed but it does line up on our side. As for more information, I am happy to, Railway continually updates our hosts throughout the day and during a rollout the update daemon locked the machine. It was a combination of multiple services that were "hot" such as logging and other services that were at load that didn't gracefully shutdown locking the machine. We then rebooted the machine that restored access to the workload.

We did recover the machine but the Infra team, is working through a core fix and in the meantime isolated that box from any new workloads so that we can make this hosthang re-producable.


Status changed to Awaiting User Response Railway 24 days ago


That said, and I never like to hold customer workloads hostage, but your current plan has no SLA, so it's currently at that service level. If this is a critical service that can't be impacted at all, we can see about getting you on a dedicated host.


thomas-arcol
PRO

23 days ago

By timestamps I mean that we observed impact starting about 15 minutes before the recorded start of impact on the status page.


Status changed to Awaiting Railway Response Railway 23 days ago


Yes, that is likely, we do need time to react to pages and confirm if it is an incident.


Status changed to Awaiting User Response Railway 23 days ago


Railway
BOT

16 days ago

This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!

Status changed to Solved Railway 16 days ago


Loading...