10 months ago
hello I noticed my streamed llm chunks from my app on railway coming in very slowly. To confirm this I made a simple bun function and ran it and sure enough I am getting around 20 tokens per second with gpt-4.1-nano
When I run it locally I constantly get over 60 and even over 100. This is obviously one of OpenAI's smaller models and gets relatively high TPS (almost 200 for gpt-4.1-nano: https://artificialanalysis.ai/models/gpt-4-1-nano/providers).
Why it it so slow on railway? Do I need to move regions? Metal edge?
We have been streaming llms for a long time. We have tier 5 on OpenAI and have never had streaming be this slow especially on the smaller models. Trying to run it on railway is slower than local and on AWS.
Any tips to speed it up would be great! Thanks!