2 months ago
Service: Ponder Indexer
Environment: Production
Issue Duration: April 12 - Present
Severity: High
Symptom:
- p99 latency spiked from ~1-2 seconds to 20-27 seconds
- p50 latency remains normal (~100-400ms)
- No application-side delays (queryMs=15.0ms, totalMs=15.5ms)
- No resource exhaustion (CPU 0.16-0.27, Memory 0.69-1.0GB)
- No 5xx errors
Evidence:
Application logs show query execution is fast (15ms) and total response time is 15.5ms, meaning the 20+ second delay occurs in Railway infrastructure, not the application.
Request: Investigate edge routing, load balancing, or connection establishment latency for the Ponder Indexer service in the production environment.5 Replies
Status changed to Awaiting Railway Response Railway • about 2 months ago
2 months ago
This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.
Status changed to Open Railway • about 2 months ago
2 months ago
@Railway this is clearly an infrastructure-level issue and server patching is not an appropriate solution here since it would not target the root cause.
Please investigate and fix the issue on your side ASAP. This issue started occurring on April 12th & there were 0 changes to the app/deployment on our side.
Railway replica ID: 63bec1ff-1282-4d1d-a103-7d2f73ee5efe
Railway deployment ID: bd90f94f-8f5a-4126-a941-e8f6c78eb451
11 days ago
The latency pattern appears consistent with an infrastructure-layer connection issue rather than application execution.
Observations:
p50 remains normal.
Application execution time is only ~15ms.
CPU and memory are healthy.
No 5xx errors.
p99 clusters around 20-27 seconds.The 20-27 second range is particularly interesting because it resembles TCP connection establishment retries or upstream connection timeout behavior rather than application processing delays.
Potential areas to investigate:
Edge → service connection establishment latency.
Load balancer routing to degraded instances.
TCP SYN retransmissions between edge and service.
Connection pool exhaustion or stale upstream connections.
Regional edge routing anomalies causing retries before successful backend selection.Do p99 requests correlate with:
specific Railway edge regions,
specific containers/replicas,
connection reuse failures,
or upstream connection retry events?The latency distribution looks more like network retry behavior than application execution latency.
6 days ago
I can confirm I'm having the same p99 spikes, to the extent of 30s+ timeouts. I was waiting to see if it was related to the edge issues that have been reported, but it doesn't appear to be and is still ongoing. I have railway services in both US-WEST and US-EAST and both experience the same issues. Below is a screenshot of 24 hour uptime across my Railway apps. The two 100% services are both hosted elsewhere (AWS and another third party hosting provider). This leads me to believe it's not the app (looks to be distributed across multiple different applications we've developed). In the case of the competing service on AWS, it's hosting exactly the same software as Railway.
Additionally, here's the results from one of the affected services over a 1 week time frame. Displayed times are in US/Eastern.
These results were tracked with Uptime Kuma on a Railway instance hosted on US-WEST.
6 days ago
Some additional context, a few of these services are proxied by Cloudflare and others are not, so that's also unlikely to be a factor here.
3 days ago
Wanted to let people know a second discussion probably related to this has surfaced: https://station.railway.com/questions/intermittent-522-connection-timed-out-556e8e65