p99 Response Latency Spikes to 20-27s
Anonymous
PROOP

2 months ago

Service: Ponder Indexer
Environment: Production
Issue Duration: April 12 - Present
Severity: High

Symptom:
- p99 latency spiked from ~1-2 seconds to 20-27 seconds
- p50 latency remains normal (~100-400ms)
- No application-side delays (queryMs=15.0ms, totalMs=15.5ms)
- No resource exhaustion (CPU 0.16-0.27, Memory 0.69-1.0GB)
- No 5xx errors

Evidence:
Application logs show query execution is fast (15ms) and total response time is 15.5ms, meaning the 20+ second delay occurs in Railway infrastructure, not the application.

Request: Investigate edge routing, load balancing, or connection establishment latency for the Ponder Indexer service in the production environment.
$20 Bounty

5 Replies

Status changed to Awaiting Railway Response Railway about 2 months ago


Railway
BOT

2 months ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open Railway about 2 months ago


Anonymous
PROOP

2 months ago

@Railway this is clearly an infrastructure-level issue and server patching is not an appropriate solution here since it would not target the root cause.

Please investigate and fix the issue on your side ASAP. This issue started occurring on April 12th & there were 0 changes to the app/deployment on our side.

Railway replica ID: 63bec1ff-1282-4d1d-a103-7d2f73ee5efe

Railway deployment ID: bd90f94f-8f5a-4126-a941-e8f6c78eb451


avnish-es
HOBBYTop 10% Contributor

11 days ago

The latency pattern appears consistent with an infrastructure-layer connection issue rather than application execution.

Observations:

p50 remains normal.

Application execution time is only ~15ms.

CPU and memory are healthy.

No 5xx errors.

p99 clusters around 20-27 seconds.

The 20-27 second range is particularly interesting because it resembles TCP connection establishment retries or upstream connection timeout behavior rather than application processing delays.

Potential areas to investigate:

Edge → service connection establishment latency.

Load balancer routing to degraded instances.

TCP SYN retransmissions between edge and service.

Connection pool exhaustion or stale upstream connections.

Regional edge routing anomalies causing retries before successful backend selection.

Do p99 requests correlate with:

specific Railway edge regions,

specific containers/replicas,

connection reuse failures,

or upstream connection retry events?

The latency distribution looks more like network retry behavior than application execution latency.


protelo
PRO

6 days ago

I can confirm I'm having the same p99 spikes, to the extent of 30s+ timeouts. I was waiting to see if it was related to the edge issues that have been reported, but it doesn't appear to be and is still ongoing. I have railway services in both US-WEST and US-EAST and both experience the same issues. Below is a screenshot of 24 hour uptime across my Railway apps. The two 100% services are both hosted elsewhere (AWS and another third party hosting provider). This leads me to believe it's not the app (looks to be distributed across multiple different applications we've developed). In the case of the competing service on AWS, it's hosting exactly the same software as Railway.

image.png

Additionally, here's the results from one of the affected services over a 1 week time frame. Displayed times are in US/Eastern.

image.png

These results were tracked with Uptime Kuma on a Railway instance hosted on US-WEST.


protelo
PRO

6 days ago

Some additional context, a few of these services are proxied by Cloudflare and others are not, so that's also unlikely to be a factor here.


nealimekenna
PRO

3 days ago

Wanted to let people know a second discussion probably related to this has surfaced: https://station.railway.com/questions/intermittent-522-connection-timed-out-556e8e65


Welcome!

Sign in to your Railway account to join the conversation.

Loading...