Highly increased Response Time p95 and p99
muratogat
PROOP

a month ago

We are seeing highly increased "worst case" response times in our service "Ponder Indexer". Please check the Response Time Metric graph. Since 11.04, p50 and p90 stay relatively unaffected, but p90 and p99 have skyrocketed. We also noticed this in practice, where occasionally some of our backend services that depend on the app on Railway took extremely long to respond. During that time we didn't do any deployments or code changes. We restarted the service that didn't fix the problem. Since yesterday we added some logs to debug had another deployment, but we can't find the problem on our end.

What has changes? Why has the service significantly worsened since last week?

Thank you.

$20 Bounty

2 Replies

Status changed to Awaiting Railway Response Railway about 1 month ago


Railway
BOT

a month ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open Railway about 1 month ago


silasmuyembi0-cyber
HOBBY

a month ago

Hey muratogar,

p50/p90 flat but p95/p99 exploding is a pretty specific signature usually it's not your app code (which matches what you're seeing, no deploys = no code changes). When only the tail moves, you're almost always looking at one of these:

A downstream dependency that got slower or flakier the one that shows up in the "sometimes slow" path, not the hot path. Could be a third-party API, a DB query that hits a cold index, an external service with its own degradation. p50 doesn't notice because most requests don't touch it (or hit cached results). p99 notices because the slowest 1% are exactly the ones that do.

Connection pool exhaustion / saturation on a specific resource DB, Redis, HTTP client pool, whatever. When the pool is healthy you get p50/p90 behavior you expect. When it occasionally saturates, requests queue up and the tail blows out. p50 barely moves.

Noisy neighbor / host-level contention on Railway's side less likely to be the cause but can contribute. Usually not this dramatic though.

A growing dataset or cache that crossed a threshold e.g. a table that's now big enough that a query without an index started doing full scans, or a Redis instance hitting memory pressure and evicting hot keys. This often correlates with "it started getting worse around X date" even without a deploy.

Given you said it started around the 11th and got much worse yesterday, my gut says it's #1 or #4. The fact that it's "Ponder Indexer" (sounds like a blockchain indexer?) makes #1 even more likely RPC providers degrade all the time and you'd never see it in your own logs unless you're specifically timing outbound calls.

Stuff I'd check in this order:

Instrument outbound calls. Wrap every external HTTP/RPC call and log duration + status. I'd bet money one of them has a p99 that's gone from 200ms to 5s+. If you're using an RPC provider (Alchemy, Infura, QuickNode, etc.), check their status page for the dates when it started they often have incidents that don't get announced loudly.

Check your DB slow query log. If you're on Postgres, enable log_min_duration_statement = 500 temporarily and see what shows up. A single query that went from fast to slow because of a missing index on a growing table will match this exact pattern perfectly.

Look at connection pool metrics. How many active / idle / waiting connections over time? If you see "waiting" spike even occasionally, there's your tail.

Check memory and GC behavior. If your service is doing occasional long GC pauses or hitting swap, p99 will tank while p50 stays fine. Railway's metrics graph should show memory look for a sawtooth pattern or anything climbing.

Timeouts. What are your outbound timeouts set to? If you have a 10s timeout on an external call and that call now times out on 1% of requests, your p99 is now ~10s by definition. Tightening timeouts + adding retries with backoff often helps the tail even when you can't fix the root cause.

Correlate with request type. Is the tail dominated by a specific endpoint? One indexer job? Group your p99 by route/handler. 9 times out of 10 it's one specific code path doing something expensive occasionally.

One practical thing that helped me on a similar issue: add a request ID + timing breakdown to your logs for any request over, say, 2 seconds. Something like "request X: db=50ms, rpc=4800ms, handler=20ms". Three slow requests with that breakdown and you'll instantly see where the time is going. Without it you're guessing.

Re: Railway itself I wouldn't rule it out entirely but when the degradation is this targeted to the tail and started on a specific date, it's almost always something downstream of your service rather than the platform. Railway's own p99 variance doesn't usually look like this.

If you can share:

what your service actually does (sounds like a blockchain indexer?)

what external services / DBs it depends on

a rough shape of the p99 graph (gradual climb vs step function on a specific day)…it'd be way easier to narrow it down. Step functions on specific days usually = external dependency changed. Gradual climbs usually =dataset/cache/memory crossing a threshold.

Good luck, let us know what you find.


silasmuyembi0-cyber

Hey muratogar, p50/p90 flat but p95/p99 exploding is a pretty specific signature usually it's not your app code (which matches what you're seeing, no deploys = no code changes). When only the tail moves, you're almost always looking at one of these: A downstream dependency that got slower or flakier the one that shows up in the "sometimes slow" path, not the hot path. Could be a third-party API, a DB query that hits a cold index, an external service with its own degradation. p50 doesn't notice because most requests don't touch it (or hit cached results). p99 notices because the slowest 1% are exactly the ones that do. Connection pool exhaustion / saturation on a specific resource DB, Redis, HTTP client pool, whatever. When the pool is healthy you get p50/p90 behavior you expect. When it occasionally saturates, requests queue up and the tail blows out. p50 barely moves. Noisy neighbor / host-level contention on Railway's side less likely to be the cause but can contribute. Usually not this dramatic though. A growing dataset or cache that crossed a threshold e.g. a table that's now big enough that a query without an index started doing full scans, or a Redis instance hitting memory pressure and evicting hot keys. This often correlates with "it started getting worse around X date" even without a deploy. Given you said it started around the 11th and got much worse yesterday, my gut says it's #1 or #4\. The fact that it's "Ponder Indexer" (sounds like a blockchain indexer?) makes #1 even more likely RPC providers degrade all the time and you'd never see it in your own logs unless you're specifically timing outbound calls. Stuff I'd check in this order: Instrument outbound calls. Wrap every external HTTP/RPC call and log duration + status. I'd bet money one of them has a p99 that's gone from 200ms to 5s+. If you're using an RPC provider (Alchemy, Infura, QuickNode, etc.), check their status page for the dates when it started they often have incidents that don't get announced loudly. Check your DB slow query log. If you're on Postgres, enable log\_min\_duration\_statement = 500 temporarily and see what shows up. A single query that went from fast to slow because of a missing index on a growing table will match this exact pattern perfectly. Look at connection pool metrics. How many active / idle / waiting connections over time? If you see "waiting" spike even occasionally, there's your tail. Check memory and GC behavior. If your service is doing occasional long GC pauses or hitting swap, p99 will tank while p50 stays fine. Railway's metrics graph should show memory look for a sawtooth pattern or anything climbing. Timeouts. What are your outbound timeouts set to? If you have a 10s timeout on an external call and that call now times out on 1% of requests, your p99 is now \~10s by definition. Tightening timeouts + adding retries with backoff often helps the tail even when you can't fix the root cause. Correlate with request type. Is the tail dominated by a specific endpoint? One indexer job? Group your p99 by route/handler. 9 times out of 10 it's one specific code path doing something expensive occasionally. One practical thing that helped me on a similar issue: add a request ID + timing breakdown to your logs for any request over, say, 2 seconds. Something like "request X: db=50ms, rpc=4800ms, handler=20ms". Three slow requests with that breakdown and you'll instantly see where the time is going. Without it you're guessing. Re: Railway itself I wouldn't rule it out entirely but when the degradation is this targeted to the tail and started on a specific date, it's almost always something downstream of your service rather than the platform. Railway's own p99 variance doesn't usually look like this. If you can share: what your service actually does (sounds like a blockchain indexer?) what external services / DBs it depends on a rough shape of the p99 graph (gradual climb vs step function on a specific day)…it'd be way easier to narrow it down. Step functions on specific days usually = external dependency changed. Gradual climbs usually =dataset/cache/memory crossing a threshold. Good luck, let us know what you find.

Anonymous
PRO

a month ago

@silasmuyembi0-cyber our API returns data from an in-memory-cache. It used to work just fine up until April 11-12th & the issue started emerging WITHOUT any changes to the code/infra/etc having been made on our side. At the same time, we clearly see that the total request processing time on the backend stays low (roughly, up to ~15ms) -- just as it used to be.

Our other Railway-hosted services are not affected by this issue & our 3rd party dependencies have nothing to do with it.

This is clearly an infrastructure-level issue on the Railway side and server code patching is not an appropriate solution here (if at all) since it would not target the root cause.

I have a couple of guesses:

  • either Railway co-hosts our service with other CPU/networking/etc-intensive tasks of other Railway customers
  • or Railway has issues with the LB/network/etc configuration (e.g., unreasonable throughput limitations/incorrectly configured routing/etc)
  • could be a combination of both above

Please investigate and fix the issue on your side.

Railway replica ID: 63bec1ff-1282-4d1d-a103-7d2f73ee5efe

Railway deployment ID: bd90f94f-8f5a-4126-a941-e8f6c78eb451

Attaching two screenshots with response-times-charts (24h and 30 days).

On the 2nd screenshot -- the initial spikes are supposedly representing an older, less performant version of our service.

And the range between March 25th and ~April 12th is the expected baseline for the current version of our service.


Welcome!

Sign in to your Railway account to join the conversation.

Loading...