Hosting ollama and medgemma model

Question

Hi,

How much would it cost to host ollama and running medgemma model [hf.co/unsloth/medgemma-4b-it-GGUF:Q4\_K\_M](http://hf.co/unsloth/medgemma-4b-it-GGUF:Q4%5FK%5FM) on it?

What would be the response times on your infrastructure for low load pilot demos and if we want to expand it, what would be the cost.

Any more information that you share would be helpful.

Thanks,

SC

ilyassbreth · Accepted Answer

here's what i can share about hosting ollama with medgemma-4b-it-q4\_k\_m on railway:

the setup you'd need:

* 4-6 vcpu
* 6gb ram (model file is \~2.5gb but needs 4-6gb for inference)
* 5-10gb storage

costs on pro plan: since you're on pro ($20/month includes $20 usage credit):

* cpu costs $20/vcpu/month, ram costs $10/gb/month
* railway charges per minute of actual usage, not 24/7

for light pilot demos with low load:

* around 10-20% average utilization = roughly $15-30/month
* your $20 credit covers most of it, so minimal extra cost

if you expand to higher traffic:

* 50% utilization = \~$60-80/month
* continuous/production use = $100-150/month

response times: important heads up , railway doesn't offer gpu compute currently, so you're running cpu-only inference. expect around 2-6 tokens/second for demos. it'll work for pilot testing but won't be super snappy

my suggestion:

* use railway's ollama template (they have one ready to go)
* start with 4 vcpu / 6gb ram config
* test it with your actual use case for a few days
* check your actual usage in the railway dashboard

for pilot demos, this should be totally doable at low cost. if you need to scale significantly or want faster inference later, you might want gpu providers like runpod or replicate

i hope this help you 🙂