Hosting ollama and medgemma model
Anonymous
PROOP

2 months ago

Hi,

How much would it cost to host ollama and running medgemma model hf.co/unsloth/medgemma-4b-it-GGUF:Q4_K_M on it?

What would be the response times on your infrastructure for low load pilot demos and if we want to expand it, what would be the cost.

Any more information that you share would be helpful.

Thanks,

SC

Solved$20 Bounty

Pinned Solution

ilyassbreth
FREE

2 months ago

here's what i can share about hosting ollama with medgemma-4b-it-q4_k_m on railway:

the setup you'd need:

  • 4-6 vcpu

  • 6gb ram (model file is ~2.5gb but needs 4-6gb for inference)

  • 5-10gb storage

costs on pro plan: since you're on pro ($20/month includes $20 usage credit):

  • cpu costs $20/vcpu/month, ram costs $10/gb/month

  • railway charges per minute of actual usage, not 24/7

for light pilot demos with low load:

  • around 10-20% average utilization = roughly $15-30/month

  • your $20 credit covers most of it, so minimal extra cost

if you expand to higher traffic:

  • 50% utilization = ~$60-80/month

  • continuous/production use = $100-150/month

response times: important heads up , railway doesn't offer gpu compute currently, so you're running cpu-only inference. expect around 2-6 tokens/second for demos. it'll work for pilot testing but won't be super snappy

my suggestion:

  • use railway's ollama template (they have one ready to go)

  • start with 4 vcpu / 6gb ram config

  • test it with your actual use case for a few days

  • check your actual usage in the railway dashboard

for pilot demos, this should be totally doable at low cost. if you need to scale significantly or want faster inference later, you might want gpu providers like runpod or replicate

i hope this help you slightly_smiling_face emoji

2 Replies

2 months ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open brody about 2 months ago


ilyassbreth
FREE

2 months ago

here's what i can share about hosting ollama with medgemma-4b-it-q4_k_m on railway:

the setup you'd need:

  • 4-6 vcpu

  • 6gb ram (model file is ~2.5gb but needs 4-6gb for inference)

  • 5-10gb storage

costs on pro plan: since you're on pro ($20/month includes $20 usage credit):

  • cpu costs $20/vcpu/month, ram costs $10/gb/month

  • railway charges per minute of actual usage, not 24/7

for light pilot demos with low load:

  • around 10-20% average utilization = roughly $15-30/month

  • your $20 credit covers most of it, so minimal extra cost

if you expand to higher traffic:

  • 50% utilization = ~$60-80/month

  • continuous/production use = $100-150/month

response times: important heads up , railway doesn't offer gpu compute currently, so you're running cpu-only inference. expect around 2-6 tokens/second for demos. it'll work for pilot testing but won't be super snappy

my suggestion:

  • use railway's ollama template (they have one ready to go)

  • start with 4 vcpu / 6gb ram config

  • test it with your actual use case for a few days

  • check your actual usage in the railway dashboard

for pilot demos, this should be totally doable at low cost. if you need to scale significantly or want faster inference later, you might want gpu providers like runpod or replicate

i hope this help you slightly_smiling_face emoji


ilyassbreth

here's what i can share about hosting ollama with medgemma-4b-it-q4_k_m on railway:the setup you'd need:4-6 vcpu6gb ram (model file is ~2.5gb but needs 4-6gb for inference)5-10gb storagecosts on pro plan: since you're on pro ($20/month includes $20 usage credit):cpu costs $20/vcpu/month, ram costs $10/gb/monthrailway charges per minute of actual usage, not 24/7for light pilot demos with low load:around 10-20% average utilization = roughly $15-30/monthyour $20 credit covers most of it, so minimal extra costif you expand to higher traffic:50% utilization = ~$60-80/monthcontinuous/production use = $100-150/monthresponse times: important heads up , railway doesn't offer gpu compute currently, so you're running cpu-only inference. expect around 2-6 tokens/second for demos. it'll work for pilot testing but won't be super snappymy suggestion:use railway's ollama template (they have one ready to go)start with 4 vcpu / 6gb ram configtest it with your actual use case for a few dayscheck your actual usage in the railway dashboardfor pilot demos, this should be totally doable at low cost. if you need to scale significantly or want faster inference later, you might want gpu providers like runpod or replicatei hope this help you

Anonymous
PROOP

2 months ago

Thanks for the cost breakdown, that was helpful. Is there an serverless option on railway and if so would it reduce cost ? I will check with runpod, replicate and modal as well.


Status changed to Solved brody about 2 months ago


Loading...