Railway Object Storage signed URL fetches degrade severely under high concurrency

zennoia

HOBBYOP

a month ago

Observations from the benchmark:

| --- | --- | --- | --- | --- | --- | --- | --- |

| 1x | 20.1s | 20.1s | 1/1 | 20.6s | 20.6s | 1/1 | 1.0x |

| 10x | 18.8s | 33.8s | 10/10 | 29.3s | 35.4s | 10/10 | 1.6x |

| 25x | 17.4s | 24.1s | 25/25 | 57.8s | 72.6s | 25/25 | 3.3x |

| 50x | 18.3s | 24.4s | 50/50 | 115.6s | 149.3s | 50/50 | 6.3x |

| 100x | 20.9s | 25.3s | 100/100 | 142.2s | 203.3s | 70/100 | 6.8x |

At 100x concurrency, Cloudflare R2 stayed around 21s p50 and completed 100/100 calls. Railway Object Storage rose to 142s p50, 203s p95, and only completed 70/100 calls. The Railway failures were Vertex/Gemini URL fetch timeouts/deadlines while attempting to fetch the signed Railway Object Storage URL.

GitHub repo with harness, README, and raw output:

https://github.com/shahaayush1999/railway-tigris-vs-cf-r2

Context:

I ran into this while debugging import tasks in my application. Customers import data into their CRM during setup, and under high concurrency I was seeing failures and very long-running tasks. Everything on my side looked fine: local behavior was fine, the app logic was fine, and the issue only appeared when many imports were happening concurrently.

Looking through failure logs, I saw cases where Google/Gemini was not able to download files from object storage due to fetch timeouts. That pointed suspicion toward the object storage provider rather than my application. I moved away from Railway Object Storage to Cloudflare R2 for this reason, because customers were seeing high failure rates and excessive latency.

This is not tied to one current Railway service. I am no longer using Railway Object Storage in production, so there is no active service I can attach as the affected service. I have also created Railway Object Storage buckets at different times, including more than a month apart, and saw the same class of issue. I previously discussed this with Jake, who said Railway did not see provider-side issues internally. I rebuilt the benchmark later in a more formal repo, at my own cost, and the result still shows the same pattern, so this does not look transient.

Benchmark methodology:

The benchmark uploads the same PDF to both Railway Object Storage and Cloudflare R2. For each provider it creates signed S3-compatible GET URLs, then sends those URLs to Gemini as file inputs. Gemini fetches the file from the provider and does almost no model work: it is prompted to simply reply "Hi".

This is intentionally not a local download benchmark. I do not have, and do not want to rent, a fleet of servers with high-end network cards just to test object storage concurrency. Using Gemini URL inputs gives me a neutral external downloader and a practical proxy for many isolated remote workers. Increasing concurrency creates many independent Gemini requests, with downloads happening on Google's side rather than on my laptop. That avoids my local network, machine, and disk being the bottleneck.

I understand this is not a perfect benchmark. Gemini/Vertex adds some noise, especially at low concurrency. But at higher concurrency the trend is very clear: Railway Object Storage deteriorates heavily while R2 remains mostly flat. This causes real product issues: long-running imports, timeout failures, retries, and poor customer experience.

What I would like Railway to investigate:

Please benchmark Railway Object Storage/Tigris under high concurrent downloads from outside Railway's network.
Ideally test with a fleet of isolated download workers, strong network, and enough disk/CPU headroom that the client side is not the bottleneck.
Please compare against a known external S3-compatible provider such as Cloudflare R2.
Please specifically look at signed URL GET behavior and timeout/deadline failures under concurrency.
I am not asking you to debug a specific app service; I am reporting a provider-level degradation pattern that I had to work around by migrating storage providers.

Solved

3 Replies

Status changed to Awaiting Railway Response Railway • 29 days ago

brody

EMPLOYEE

a month ago

I've gone through your write-up and the benchmark numbers. I'm not going to dismiss what you're seeing, but I want to be honest about where this lands for us right now.

We aren't seeing this pattern reported across our user base, so as it stands this isn't something we're able to prioritize an investigation into. For us to set aside cycles at this depth, we'd need to see the same degradation surfacing from multiple users.

That said, opening this thread does put the report on record. It stays tracked, and if enough corroborating reports of the same issue come in, this is what we'll come back to when it's time to dig in.

Status changed to Awaiting User Response Railway • 29 days ago

zennoia

HOBBYOP

a month ago

Hi Brody,

Appreciate the honesty.

Jake asked me to create a repro harness on twitter so I spent my time to carve out a GitHub repo just for easy repro for railway employees, and did high concurrency runs at my own cost just to ready some bench results before reporting, but now if you say you're not going to look into this then well.
I would hesitate to call 10 or 25 high concurrency, but just 10 simultaneous file downloads and the deterioration begins, so I'm frankly surprised this evaluation wasn't done on railway's end or tigris' end. I'd be happy to be proven wrong.
After confirming this issue exists, I had to migrate away from Railway/Tigris object storage to another reliable vendor. Tigris might be throttling egress bandwidth per customer / railway customers specifically, or their infrastructure is facing silent deterioration under concurrency. None of this is documented, so I'm inclined to think it's silent deterioration.
Regarding your point about this pattern reported across your user base: it is deteriorated download speeds starting at low concurrency and massive deterioration higher, but yes, no outright failures, so unlikely that you will see reports about failures, just every one of your user will face these silent cuts.

It shouldn't take more than 10 minutes for you to spin up an object storage and then test baseline and then do just one run at 25 concurrency, that should be plenty for you to escalate with Tigris or atleast re-assess the priority internally.

Status changed to Awaiting Railway Response Railway • 28 days ago

brody

EMPLOYEE

a month ago

That's fair, and I don't want to misrepresent what was set up here. The point of putting together the repro and opening this thread was to get the report captured and tracked on our side. It wasn't a commitment that we'd run the benchmark ourselves or ship a fix off the back of it, and I'd rather be straight with you about that than imply an investigation is queued up when it isn't.

The silent-degradation point is a reasonable one, and it's on the thread now exactly as you've framed it. If it lines up with other signal we see over time, this is what we'll come back to.

Status changed to Awaiting User Response Railway • 28 days ago

Railway

BOT

21 days ago

This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!

Status changed to Solved Railway • 21 days ago

Welcome!