9 months ago
hello I noticed my streamed llm chunks from my app on railway coming in very slowly. To confirm this I made a simple bun function and ran it and sure enough I am getting around 20 tokens per second with gpt-4.1-nano
When I run it locally I constantly get over 60 and even over 100. This is obviously one of OpenAI's smaller models and gets relatively high TPS (almost 200 for gpt-4.1-nano: https://artificialanalysis.ai/models/gpt-4-1-nano/providers).
Why it it so slow on railway? Do I need to move regions? Metal edge?
We have been streaming llms for a long time. We have tier 5 on OpenAI and have never had streaming be this slow especially on the smaller models. Trying to run it on railway is slower than local and on AWS.
Any tips to speed it up would be great! Thanks!
8 Replies
9 months ago
if you haven't tried metal yet then you definitely should, and you can choose a region close to you
I thought railway functions were on metal, however US west is closer to OpenAI servers than where I am, so if anything it should get getting slightly higher TPM than my local machine.
Also the service that I experienced this issue on initially is on metal west, I’m just now using railway functions to easily replicate the slow TPM on demand
9 months ago
did you enable metal edge as well?
No I since it doesn’t have a public facing endpoint, I am not able to enable it in that case correct?
9 months ago
ah right yeah probably not relevant in that case
Either way 20 TPM on a bun function on 4.1-nano seems like something is definitely up
9 months ago
i wonder how TPM would be effected here, you are just using the API, or are you also doing compute on the instance?