Latency Increase Without Deploy or Traffic Change
injung
PROOP

a month ago

Hi Railway team,

We are seeing a sudden and sustained increase in latency, even though there have been no deploys, configuration changes, or traffic spikes on our side. This started recently and does not look like normal behavior.

  • Database queries have become noticeably slower
  • API response times have increased accordingly (p95 latency rising)
  • The issue affects multiple endpoints and queries
  • Traffic levels remain consistent with our usual baseline

There were no changes in application code or infrastructure configuration. The degradation is system-wide, not isolated to a specific query or endpoint. Even simple queries that normally complete in a few milliseconds are now taking significantly longer

It seems unlikely to be an application-level issue and more likely related to the underlying database or host environment. This is impacting production traffic and revenue, so we'd really appreciate your help in investigating this.

Thanks.

$20 Bounty

2 Replies

Status changed to Open Railway about 1 month ago


injung
PROOP

a month ago

It seems to have returned to normal about 4 hours ago.

However, if performance can vary this significantly depending on host conditions, it makes it difficult for us to fully rely on the Railway platform for production workloads.

We'd really appreciate more clarity on:

  • what might have caused this behavior, and
  • how this will be prevented or mitigated going forward.

Thanks 🙏

Attachments


suryalim11
HOBBYTop 5% Contributor

2 days ago

esolved, but I understand your concern about production reliability. Here is some context on what likely caused this and what you can do to protect against it:

What likely caused the latency spike

Railway runs services on shared host infrastructure (GCP compute). Without a deploy or config change on your side, sudden latency increases are almost always caused by one of:

  1. Noisy neighbor on the same host — another tenant's workload puts heavy CPU/IO pressure on the underlying host, degrading your database I/O. This is the most common cause for "unexplained" slowdowns that resolve on their own.

  2. Host-level maintenance or live migration — GCP occasionally performs live migrations of VMs for maintenance. During this, there can be brief performance degradation that appears as query slowdowns.

  3. Database buffer/cache eviction — If the host under memory pressure, the Postgres buffer cache can get evicted, causing queries to go to disk instead of memory, drastically increasing latency.

How to mitigate and detect this faster going forward

  1. Set up Railway observability alerts — Monitor your p95/p99 response times and database CPU/memory metrics in Railway's Observability tab. Set up external uptime/latency monitoring (e.g., Better Uptime, Checkly) that can alert you within minutes.

  2. Enable Postgres performance baseline queries — Run pg_stat_statements queries periodically to detect when query plans change or cache hit ratio drops below 99%.

  3. Consider vertical scaling for production — Upgrading to a larger Railway instance gives your Postgres service dedicated-like resources with less shared-host pressure.

  4. Check status.railway.com — Future incidents like this are tracked there. Subscribing to status updates means you'll know if Railway is seeing host-level issues before spending time debugging your code.

The Railway team can review host-level metrics for that time window if you share the exact timeframe when the latency was elevated.


Welcome!

Sign in to your Railway account to join the conversation.

Loading...