22 days ago
Hi,
We’re seeing two issues on our RunPod Serverless endpoints and would like help investigating.
1. Text endpoint (vLLM) – worker crashes
Template: vLLM Serverless, H100 80GB
Model: kosbu/llama-3.3-70b-instruct-awq (AWQ)
Config: 1 active worker, RunPod Serverless Worker v1.8.1
Workers sometimes crash mid-request. Logs show:
ERROR [core_client.py:605] Engine core proc EngineCore_DP0 died unexpectedly, shutting down client.
WARNING [multiproc_executor.py:786] WorkerProc was terminated
Followed by:
Exception ignored in: <function AsyncLLM.__del__ at 0x...>
TypeError: 'NoneType' object is not callable
File ".../vllm/v1/engine/async_llm.py", line 273, in shutdown
Clients then receive 500 - {"error":"Error processing the request"}. This happens intermittently during normal load, not only under spikes.
2. Image endpoint (ComfyUI) – 502 on job-take
Template: ComfyUI Serverless, RTX 5090
Config: 1 active worker
Workers sometimes fail when polling for jobs:
Failed to get job. | Error Type: ClientResponseError | Error Message: 502, message='Bad Gateway',
Questions
For the vLLM endpoint: are these crashes a known issue (e.g. OOM, CUDA, or vLLM async shutdown)? Any recommended config changes or versions?
For the ComfyUI endpoint: is the 502 on job-take a known RunPod API issue, and is there a recommended retry/backoff strategy?
We’ve added retries on our side for 500/502/503, but we’d like to understand root cause and any server-side mitigations.
Thanks.
2 Replies
22 days ago
I didn't intend for this post to be public - it's an experimental thing we tried as a test - however, now it's been made public if anyone has any insight, it's very welcome.
22 days ago
This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.
Status changed to Open Railway • 22 days ago
17 days ago
Could you clarify how Railway is related to the issues you're experiencing with RunPod? I'm not sure I follow
Status changed to Awaiting User Response Railway • 17 days ago
10 days ago
This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!
Status changed to Solved Railway • 10 days ago