a month ago
Hello Team,
I’m facing an Out of Memory issue while running Celery with Django on Railway.
When I checked the service metrics, it shows that Celery is using around 3GB of memory, but my Pro plan allows up to 32GB.
Could you please provide some information on why this error occurs and how I can utilize more memory for my Celery workers?
Thank you,
Akshay
26 Replies
a month ago
Hey there! We've found the following might help you get unblocked faster:
If you find the answer from one of these, please let us know by solving the thread!
a month ago
hello?
a month ago
This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.
Status changed to Open brody • about 1 month ago
vayakakshay
hello?
a month ago
Go to your celery service -> settings -> resource limits (shown below) and confirm it's at 32GB.
if its at ~32 and its still being OOM, can you provide your logs?
Attachments
yeeet
Go to your celery service -> settings -> resource limits (shown below) and confirm it's at 32GB.if its at ~32 and its still being OOM, can you provide your logs?
a month ago
No here its already 32 selected
a month ago
?
vayakakshay
Also check this logs
a month ago
hey, you have 2 separate issues.
you have to fix the dup count within /app/keyword_repo/views.py and ensure that the duplicate_count variable is defined
For your celery worker, it's just failing due to memory constantly, but one thing you can set is concurrency and max tasks to see if it resolves the actual memory issue, so in your start command, celery -A your_app.celery worker --loglevel=info --concurrency=4 --max-tasks-per-child=50. without directly seeing your source code, i cant tell if there's a small memory leak, but the max tasks per child will resolve it. you can increase it safely after seeing it doesn't crash to 100/150.
a month ago
I have already added limited concurrency but still getting the issue
and you can check the metrics
a month ago
And regarding the duplicate count — you can ignore it. That issue was on my side, and I’ve already resolved it.
a month ago
I have limit: 32GB
and these are the my celery parameters:
CELERY_WORKER_CONCURRENCY = 7
CELERY_WORKER_PREFETCH_MULTIPLIER = 1
CELERY_TASK_ACKS_LATE = True
CELERY_WORKER_DISABLE_RATE_LIMITS = False
# Aggressive memory management for long-running tasks
CELERY_WORKER_MAX_MEMORY_PER_CHILD = 4194304 # 4 GB in KB
CELERY_WORKER_MAX_TASKS_PER_CHILD = 100
CELERY_WORKER_POOL = 'prefork' # Use solo pool to avoid multiprocessing issues
a month ago
4 GB * 7 = 28GB
vayakakshay
I have limit: 32GBand these are the my celery parameters:CELERY_WORKER_CONCURRENCY = 7CELERY_WORKER_PREFETCH_MULTIPLIER = 1CELERY_TASK_ACKS_LATE = TrueCELERY_WORKER_DISABLE_RATE_LIMITS = False# Aggressive memory management for long-running tasksCELERY_WORKER_MAX_MEMORY_PER_CHILD = 4194304 # 4 GB in KBCELERY_WORKER_MAX_TASKS_PER_CHILD = 100CELERY_WORKER_POOL = 'prefork' # Use solo pool to avoid multiprocessing issues
a month ago
your celery parameters are causing it. you have 4GB left, but what you're attempting to do is compress everything through zstd, and its likely crashing because it needs more than 4GB of memory to compress everything. your main celery worker starts, loads your app into langsmith into memory, then you fork 7 child processes, and when its then trying to compress the data from all 7 workers at once, it crashes because it only have 4GB memory left.
you have 2 paths to fix this, either reduce the worker concurrency to 6, re-test it and see if it doesn't crash, and your other option is change it to solo pool. if it's still crashing with 6, then you just need to reduce it to 5. alternatively, you can make code-level changes to make 7 worker concurrency work if you work towards batching langsmith + add timeouts
a month ago
before that I am using less but its still crashed
a month ago
and one more thing in matrics why its only 3GB consumed
vayakakshay
and one more thing in matrics why its only 3GB consumed
a month ago
crash is probably happening in ms time, and probably isnt being picked up in the metrics. if it's using less, how many celery workers did you use? in that case, youre attempting to compress large data and it's still crashing. your only option then is to rewrite and optimize langsmith
a month ago
but if we will decrease it then how to handle the 20+ users on platform?
vayakakshay
but if we will decrease it then how to handle the 20+ users on platform?
a month ago
you have to optimize langsmith if you want to handle more celery workers, otherwise youd just have to reduce it. without seeing your source code, metrics of your APIs, etc. i cant tell you how itll handle anything
a month ago
okay let me try that one but whats your thought on like its feasible to use railways if we have more than 100 users because I have many agents that runs in background every single for one user
vayakakshay
okay let me try that one but whats your thought on like its feasible to use railways if we have more than 100 users because I have many agents that runs in background every single for one user
a month ago
realistically you can scale by adding replicas, use routing, or if you just want to scale, then go enterprise, but it has a minimum monthly spend
a month ago
thats expensive
a month ago
Can you suggest me what kind of changes that I can do for langsmith optimization?
vayakakshay
Can you suggest me what kind of changes that I can do for langsmith optimization?
a month ago
turn off artifacts, store large inputs/outputs in S3, avoid compressing data and store in s3, use something like orjson
Status changed to Solved ray-chen • 23 days ago