p99 Response Latency Spikes to 20-27s

Anonymous

PROOP

3 months ago

Service: Ponder Indexer
Environment: Production
Issue Duration: April 12 - Present
Severity: High

Symptom:
- p99 latency spiked from ~1-2 seconds to 20-27 seconds
- p50 latency remains normal (~100-400ms)
- No application-side delays (queryMs=15.0ms, totalMs=15.5ms)
- No resource exhaustion (CPU 0.16-0.27, Memory 0.69-1.0GB)
- No 5xx errors

Evidence:
Application logs show query execution is fast (15ms) and total response time is 15.5ms, meaning the 20+ second delay occurs in Railway infrastructure, not the application.

Request: Investigate edge routing, load balancing, or connection establishment latency for the Ponder Indexer service in the production environment.

$20 Bounty

7 Replies

Status changed to Awaiting Railway Response Railway • 3 months ago

Railway

BOT

3 months ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open Railway • 3 months ago

Anonymous

PROOP

3 months ago

@Railway this is clearly an infrastructure-level issue and server patching is not an appropriate solution here since it would not target the root cause.

Please investigate and fix the issue on your side ASAP. This issue started occurring on April 12th & there were 0 changes to the app/deployment on our side.

Railway replica ID: 63bec1ff-1282-4d1d-a103-7d2f73ee5efe

Railway deployment ID: bd90f94f-8f5a-4126-a941-e8f6c78eb451

avnish-es

HOBBY

2 months ago

The latency pattern appears consistent with an infrastructure-layer connection issue rather than application execution.

Observations:

p50 remains normal.

Application execution time is only ~15ms.

CPU and memory are healthy.

No 5xx errors.

p99 clusters around 20-27 seconds.

The 20-27 second range is particularly interesting because it resembles TCP connection establishment retries or upstream connection timeout behavior rather than application processing delays.

Potential areas to investigate:

Edge → service connection establishment latency.

Load balancer routing to degraded instances.

TCP SYN retransmissions between edge and service.

Connection pool exhaustion or stale upstream connections.

Regional edge routing anomalies causing retries before successful backend selection.

Do p99 requests correlate with:

specific Railway edge regions,

specific containers/replicas,

connection reuse failures,

or upstream connection retry events?

The latency distribution looks more like network retry behavior than application execution latency.

protelo

PRO

2 months ago

I can confirm I'm having the same p99 spikes, to the extent of 30s+ timeouts. I was waiting to see if it was related to the edge issues that have been reported, but it doesn't appear to be and is still ongoing. I have railway services in both US-WEST and US-EAST and both experience the same issues. Below is a screenshot of 24 hour uptime across my Railway apps. The two 100% services are both hosted elsewhere (AWS and another third party hosting provider). This leads me to believe it's not the app (looks to be distributed across multiple different applications we've developed). In the case of the competing service on AWS, it's hosting exactly the same software as Railway.

Additionally, here's the results from one of the affected services over a 1 week time frame. Displayed times are in US/Eastern.

These results were tracked with Uptime Kuma on a Railway instance hosted on US-WEST.

Attachments

image.png

protelo

PRO

2 months ago

Some additional context, a few of these services are proxied by Cloudflare and others are not, so that's also unlikely to be a factor here.

nealimekenna

PRO

2 months ago

Wanted to let people know a second discussion probably related to this has surfaced: https://station.railway.com/questions/intermittent-522-connection-timed-out-556e8e65

angelo-railway

EMPLOYEE

a month ago

We've made a series of edge and routing improvements since you reported this. Are you still seeing the elevated response times? If so, share a couple of recent X-Railway-Request-Ids and we'll trace them; if it's cleared up, we'll go ahead and close this out.

Railway Team

angelo-railway

We've made a series of edge and routing improvements since you reported this. Are you still seeing the elevated response times? If so, share a couple of recent `X-Railway-Request-Id`s and we'll trace them; if it's cleared up, we'll go ahead and close this out. Railway Team

protelo

PRO

a month ago

My last high latency event was on June 16th. Seems to have been much better since then!

Welcome!