5 months ago
Hi,
How much would it cost to host ollama and running medgemma model hf.co/unsloth/medgemma-4b-it-GGUF:Q4_K_M on it?
What would be the response times on your infrastructure for low load pilot demos and if we want to expand it, what would be the cost.
Any more information that you share would be helpful.
Thanks,
SC
Pinned Solution
5 months ago
here's what i can share about hosting ollama with medgemma-4b-it-q4_k_m on railway:
the setup you'd need:
- 4-6 vcpu
- 6gb ram (model file is ~2.5gb but needs 4-6gb for inference)
- 5-10gb storage
costs on pro plan: since you're on pro ($20/month includes $20 usage credit):
- cpu costs $20/vcpu/month, ram costs $10/gb/month
- railway charges per minute of actual usage, not 24/7
for light pilot demos with low load:
- around 10-20% average utilization = roughly $15-30/month
- your $20 credit covers most of it, so minimal extra cost
if you expand to higher traffic:
- 50% utilization = ~$60-80/month
- continuous/production use = $100-150/month
response times: important heads up , railway doesn't offer gpu compute currently, so you're running cpu-only inference. expect around 2-6 tokens/second for demos. it'll work for pilot testing but won't be super snappy
my suggestion:
- use railway's ollama template (they have one ready to go)
- start with 4 vcpu / 6gb ram config
- test it with your actual use case for a few days
- check your actual usage in the railway dashboard
for pilot demos, this should be totally doable at low cost. if you need to scale significantly or want faster inference later, you might want gpu providers like runpod or replicate
i hope this help you 🙂
2 Replies
5 months ago
This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.
Status changed to Open brody • 5 months ago
5 months ago
here's what i can share about hosting ollama with medgemma-4b-it-q4_k_m on railway:
the setup you'd need:
- 4-6 vcpu
- 6gb ram (model file is ~2.5gb but needs 4-6gb for inference)
- 5-10gb storage
costs on pro plan: since you're on pro ($20/month includes $20 usage credit):
- cpu costs $20/vcpu/month, ram costs $10/gb/month
- railway charges per minute of actual usage, not 24/7
for light pilot demos with low load:
- around 10-20% average utilization = roughly $15-30/month
- your $20 credit covers most of it, so minimal extra cost
if you expand to higher traffic:
- 50% utilization = ~$60-80/month
- continuous/production use = $100-150/month
response times: important heads up , railway doesn't offer gpu compute currently, so you're running cpu-only inference. expect around 2-6 tokens/second for demos. it'll work for pilot testing but won't be super snappy
my suggestion:
- use railway's ollama template (they have one ready to go)
- start with 4 vcpu / 6gb ram config
- test it with your actual use case for a few days
- check your actual usage in the railway dashboard
for pilot demos, this should be totally doable at low cost. if you need to scale significantly or want faster inference later, you might want gpu providers like runpod or replicate
i hope this help you 🙂
ilyassbreth
here's what i can share about hosting ollama with medgemma-4b-it-q4\_k\_m on railway: the setup you'd need: * 4-6 vcpu * 6gb ram (model file is \~2.5gb but needs 4-6gb for inference) * 5-10gb storage costs on pro plan: since you're on pro ($20/month includes $20 usage credit): * cpu costs $20/vcpu/month, ram costs $10/gb/month * railway charges per minute of actual usage, not 24/7 for light pilot demos with low load: * around 10-20% average utilization = roughly $15-30/month * your $20 credit covers most of it, so minimal extra cost if you expand to higher traffic: * 50% utilization = \~$60-80/month * continuous/production use = $100-150/month response times: important heads up , railway doesn't offer gpu compute currently, so you're running cpu-only inference. expect around 2-6 tokens/second for demos. it'll work for pilot testing but won't be super snappy my suggestion: * use railway's ollama template (they have one ready to go) * start with 4 vcpu / 6gb ram config * test it with your actual use case for a few days * check your actual usage in the railway dashboard for pilot demos, this should be totally doable at low cost. if you need to scale significantly or want faster inference later, you might want gpu providers like runpod or replicate i hope this help you 🙂
5 months ago
Thanks for the cost breakdown, that was helpful. Is there an serverless option on railway and if so would it reduce cost ? I will check with runpod, replicate and modal as well.
Status changed to Solved brody • 5 months ago