2 months ago
Hi,
How much would it cost to host ollama and running medgemma model hf.co/unsloth/medgemma-4b-it-GGUF:Q4_K_M on it?
What would be the response times on your infrastructure for low load pilot demos and if we want to expand it, what would be the cost.
Any more information that you share would be helpful.
Thanks,
SC
Pinned Solution
2 months ago
here's what i can share about hosting ollama with medgemma-4b-it-q4_k_m on railway:
the setup you'd need:
4-6 vcpu
6gb ram (model file is ~2.5gb but needs 4-6gb for inference)
5-10gb storage
costs on pro plan: since you're on pro ($20/month includes $20 usage credit):
cpu costs $20/vcpu/month, ram costs $10/gb/month
railway charges per minute of actual usage, not 24/7
for light pilot demos with low load:
around 10-20% average utilization = roughly $15-30/month
your $20 credit covers most of it, so minimal extra cost
if you expand to higher traffic:
50% utilization = ~$60-80/month
continuous/production use = $100-150/month
response times: important heads up , railway doesn't offer gpu compute currently, so you're running cpu-only inference. expect around 2-6 tokens/second for demos. it'll work for pilot testing but won't be super snappy
my suggestion:
use railway's ollama template (they have one ready to go)
start with 4 vcpu / 6gb ram config
test it with your actual use case for a few days
check your actual usage in the railway dashboard
for pilot demos, this should be totally doable at low cost. if you need to scale significantly or want faster inference later, you might want gpu providers like runpod or replicate
i hope this help you 
2 Replies
2 months ago
This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.
Status changed to Open brody • about 2 months ago
2 months ago
here's what i can share about hosting ollama with medgemma-4b-it-q4_k_m on railway:
the setup you'd need:
4-6 vcpu
6gb ram (model file is ~2.5gb but needs 4-6gb for inference)
5-10gb storage
costs on pro plan: since you're on pro ($20/month includes $20 usage credit):
cpu costs $20/vcpu/month, ram costs $10/gb/month
railway charges per minute of actual usage, not 24/7
for light pilot demos with low load:
around 10-20% average utilization = roughly $15-30/month
your $20 credit covers most of it, so minimal extra cost
if you expand to higher traffic:
50% utilization = ~$60-80/month
continuous/production use = $100-150/month
response times: important heads up , railway doesn't offer gpu compute currently, so you're running cpu-only inference. expect around 2-6 tokens/second for demos. it'll work for pilot testing but won't be super snappy
my suggestion:
use railway's ollama template (they have one ready to go)
start with 4 vcpu / 6gb ram config
test it with your actual use case for a few days
check your actual usage in the railway dashboard
for pilot demos, this should be totally doable at low cost. if you need to scale significantly or want faster inference later, you might want gpu providers like runpod or replicate
i hope this help you 
ilyassbreth
here's what i can share about hosting ollama with medgemma-4b-it-q4_k_m on railway:the setup you'd need:4-6 vcpu6gb ram (model file is ~2.5gb but needs 4-6gb for inference)5-10gb storagecosts on pro plan: since you're on pro ($20/month includes $20 usage credit):cpu costs $20/vcpu/month, ram costs $10/gb/monthrailway charges per minute of actual usage, not 24/7for light pilot demos with low load:around 10-20% average utilization = roughly $15-30/monthyour $20 credit covers most of it, so minimal extra costif you expand to higher traffic:50% utilization = ~$60-80/monthcontinuous/production use = $100-150/monthresponse times: important heads up , railway doesn't offer gpu compute currently, so you're running cpu-only inference. expect around 2-6 tokens/second for demos. it'll work for pilot testing but won't be super snappymy suggestion:use railway's ollama template (they have one ready to go)start with 4 vcpu / 6gb ram configtest it with your actual use case for a few dayscheck your actual usage in the railway dashboardfor pilot demos, this should be totally doable at low cost. if you need to scale significantly or want faster inference later, you might want gpu providers like runpod or replicatei hope this help you
2 months ago
Thanks for the cost breakdown, that was helpful. Is there an serverless option on railway and if so would it reduce cost ? I will check with runpod, replicate and modal as well.
Status changed to Solved brody • about 2 months ago