Highly increased Response Time p95 and p99

muratogat

PROOP

3 months ago

We are seeing highly increased "worst case" response times in our service "Ponder Indexer". Please check the Response Time Metric graph. Since 11.04, p50 and p90 stay relatively unaffected, but p90 and p99 have skyrocketed. We also noticed this in practice, where occasionally some of our backend services that depend on the app on Railway took extremely long to respond. During that time we didn't do any deployments or code changes. We restarted the service that didn't fix the problem. Since yesterday we added some logs to debug had another deployment, but we can't find the problem on our end.

What has changes? Why has the service significantly worsened since last week?

Thank you.

$20 Bounty

2 Replies

Status changed to Awaiting Railway Response Railway • 3 months ago

Railway

BOT

3 months ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open Railway • 3 months ago

Anonymous

PRO

3 months ago

@silasmuyembi0-cyber our API returns data from an in-memory-cache. It used to work just fine up until April 11-12th & the issue started emerging WITHOUT any changes to the code/infra/etc having been made on our side. At the same time, we clearly see that the total request processing time on the backend stays low (roughly, up to ~15ms) -- just as it used to be.

Our other Railway-hosted services are not affected by this issue & our 3rd party dependencies have nothing to do with it.

This is clearly an infrastructure-level issue on the Railway side and server code patching is not an appropriate solution here (if at all) since it would not target the root cause.

I have a couple of guesses:

either Railway co-hosts our service with other CPU/networking/etc-intensive tasks of other Railway customers
or Railway has issues with the LB/network/etc configuration (e.g., unreasonable throughput limitations/incorrectly configured routing/etc)
could be a combination of both above

Please investigate and fix the issue on your side.

Railway replica ID: 63bec1ff-1282-4d1d-a103-7d2f73ee5efe

Railway deployment ID: bd90f94f-8f5a-4126-a941-e8f6c78eb451

Attaching two screenshots with response-times-charts (24h and 30 days).

On the 2nd screenshot -- the initial spikes are supposedly representing an older, less performant version of our service.

And the range between March 25th and ~April 12th is the expected baseline for the current version of our service.

Attachments

image.png

angelo-railway

EMPLOYEE

19 days ago

We've made a series of edge and routing improvements since you reported this. Are you still seeing the elevated response times? If so, share a couple of recent X-Railway-Request-Ids and we'll trace them; if it's cleared up, we'll go ahead and close this out.

Railway Team

Welcome!