Out of Memory Issue with Celery on Railway (Despite 32GB Pro Plan)

vayakakshay

PROOP

4 months ago

Hello Team,

I’m facing an Out of Memory issue while running Celery with Django on Railway.
When I checked the service metrics, it shows that Celery is using around 3GB of memory, but my Pro plan allows up to 32GB.

Could you please provide some information on why this error occurs and how I can utilize more memory for my Celery workers?

Thank you,
Akshay

Solved$10 Bounty

Pinned Solution

yeeet

PRO

4 months ago

turn off artifacts, store large inputs/outputs in S3, avoid compressing data and store in s3, use something like orjson

26 Replies

Railway

BOT

4 months ago

Hey there! We've found the following might help you get unblocked faster:

If you find the answer from one of these, please let us know by solving the thread!

vayakakshay

PROOP

4 months ago

hello?

brody

EMPLOYEE

4 months ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open brody • 4 months ago

vayakakshay

hello?

yeeet

PRO

4 months ago

Go to your celery service -> settings -> resource limits (shown below) and confirm it's at 32GB.

if its at ~32 and its still being OOM, can you provide your logs?

Attachments

image.png

yeeet

Go to your celery service -> settings -> resource limits (shown below) and confirm it's at 32GB.if its at ~32 and its still being OOM, can you provide your logs?

vayakakshay

PROOP

4 months ago

No here its already 32 selected

yeeet

PRO

4 months ago

we need to see your logs, can you provide them?

vayakakshay

PROOP

4 months ago

Check file

Attachments

logs.176216...

vayakakshay

PROOP

4 months ago

Check memory usage

Attachments

image.png

vayakakshay

PROOP

4 months ago

This is configuration

Attachments

image.png

vayakakshay

PROOP

4 months ago

vayakakshay

PROOP

4 months ago

Also check this logs

Attachments

logs.176217...

vayakakshay

Also check this logs

yeeet

PRO

4 months ago

hey, you have 2 separate issues.

you have to fix the dup count within /app/keyword_repo/views.py and ensure that the duplicate_count variable is defined

For your celery worker, it's just failing due to memory constantly, but one thing you can set is concurrency and max tasks to see if it resolves the actual memory issue, so in your start command, celery -A your_app.celery worker --loglevel=info --concurrency=4 --max-tasks-per-child=50. without directly seeing your source code, i cant tell if there's a small memory leak, but the max tasks per child will resolve it. you can increase it safely after seeing it doesn't crash to 100/150.

vayakakshay

PROOP

4 months ago

I have already added limited concurrency but still getting the issue

and you can check the metrics

vayakakshay

PROOP

4 months ago

And regarding the duplicate count — you can ignore it. That issue was on my side, and I’ve already resolved it.

vayakakshay

PROOP

4 months ago

I have limit: 32GB

and these are the my celery parameters:

CELERY_WORKER_CONCURRENCY = 7

CELERY_WORKER_PREFETCH_MULTIPLIER = 1

CELERY_TASK_ACKS_LATE = True

CELERY_WORKER_DISABLE_RATE_LIMITS = False

# Aggressive memory management for long-running tasks

CELERY_WORKER_MAX_MEMORY_PER_CHILD = 4194304 # 4 GB in KB

CELERY_WORKER_MAX_TASKS_PER_CHILD = 100

CELERY_WORKER_POOL = 'prefork' # Use solo pool to avoid multiprocessing issues

vayakakshay

PROOP

4 months ago

4 GB * 7 = 28GB

vayakakshay

I have limit: 32GBand these are the my celery parameters:CELERY_WORKER_CONCURRENCY = 7CELERY_WORKER_PREFETCH_MULTIPLIER = 1CELERY_TASK_ACKS_LATE = TrueCELERY_WORKER_DISABLE_RATE_LIMITS = False# Aggressive memory management for long-running tasksCELERY_WORKER_MAX_MEMORY_PER_CHILD = 4194304 # 4 GB in KBCELERY_WORKER_MAX_TASKS_PER_CHILD = 100CELERY_WORKER_POOL = 'prefork' # Use solo pool to avoid multiprocessing issues

yeeet

PRO

4 months ago

your celery parameters are causing it. you have 4GB left, but what you're attempting to do is compress everything through zstd, and its likely crashing because it needs more than 4GB of memory to compress everything. your main celery worker starts, loads your app into langsmith into memory, then you fork 7 child processes, and when its then trying to compress the data from all 7 workers at once, it crashes because it only have 4GB memory left.

you have 2 paths to fix this, either reduce the worker concurrency to 6, re-test it and see if it doesn't crash, and your other option is change it to solo pool. if it's still crashing with 6, then you just need to reduce it to 5. alternatively, you can make code-level changes to make 7 worker concurrency work if you work towards batching langsmith + add timeouts

vayakakshay

PROOP

4 months ago

before that I am using less but its still crashed

vayakakshay

PROOP

4 months ago

and one more thing in matrics why its only 3GB consumed

vayakakshay

and one more thing in matrics why its only 3GB consumed

yeeet

PRO

4 months ago

crash is probably happening in ms time, and probably isnt being picked up in the metrics. if it's using less, how many celery workers did you use? in that case, youre attempting to compress large data and it's still crashing. your only option then is to rewrite and optimize langsmith

vayakakshay

PROOP

4 months ago

but if we will decrease it then how to handle the 20+ users on platform?

vayakakshay

but if we will decrease it then how to handle the 20+ users on platform?

yeeet

PRO

4 months ago

you have to optimize langsmith if you want to handle more celery workers, otherwise youd just have to reduce it. without seeing your source code, metrics of your APIs, etc. i cant tell you how itll handle anything

vayakakshay

PROOP

4 months ago

okay let me try that one but whats your thought on like its feasible to use railways if we have more than 100 users because I have many agents that runs in background every single for one user

vayakakshay

okay let me try that one but whats your thought on like its feasible to use railways if we have more than 100 users because I have many agents that runs in background every single for one user

yeeet

PRO

4 months ago

realistically you can scale by adding replicas, use routing, or if you just want to scale, then go enterprise, but it has a minimum monthly spend

vayakakshay

PROOP

4 months ago

thats expensive

vayakakshay

PROOP

4 months ago

Can you suggest me what kind of changes that I can do for langsmith optimization?

vayakakshay

Can you suggest me what kind of changes that I can do for langsmith optimization?

yeeet

PRO

4 months ago

turn off artifacts, store large inputs/outputs in S3, avoid compressing data and store in s3, use something like orjson

Status changed to Solved ray-chen • 4 months ago