Serverless worker crashes (H100 text) and 502s on ComfyUI image endpoint
self-hq
PROOP

22 days ago

Hi,

We’re seeing two issues on our RunPod Serverless endpoints and would like help investigating.

1. Text endpoint (vLLM) – worker crashes

  • Template: vLLM Serverless, H100 80GB

  • Model: kosbu/llama-3.3-70b-instruct-awq (AWQ)

  • Config: 1 active worker, RunPod Serverless Worker v1.8.1

Workers sometimes crash mid-request. Logs show:

ERROR [core_client.py:605] Engine core proc EngineCore_DP0 died unexpectedly, shutting down client.

WARNING [multiproc_executor.py:786] WorkerProc was terminated

Followed by:

Exception ignored in: <function AsyncLLM.__del__ at 0x...>

TypeError: 'NoneType' object is not callable

  File ".../vllm/v1/engine/async_llm.py", line 273, in shutdown

Clients then receive 500 - {"error":"Error processing the request"}. This happens intermittently during normal load, not only under spikes.

2. Image endpoint (ComfyUI) – 502 on job-take

  • Template: ComfyUI Serverless, RTX 5090

  • Config: 1 active worker

Workers sometimes fail when polling for jobs:

Failed to get job. | Error Type: ClientResponseError | Error Message: 502, message='Bad Gateway', 

Questions

  1. For the vLLM endpoint: are these crashes a known issue (e.g. OOM, CUDA, or vLLM async shutdown)? Any recommended config changes or versions?

  1. For the ComfyUI endpoint: is the 502 on job-take a known RunPod API issue, and is there a recommended retry/backoff strategy?

We’ve added retries on our side for 500/502/503, but we’d like to understand root cause and any server-side mitigations.

Thanks.

Solved$20 Bounty

2 Replies

self-hq
PROOP

22 days ago

I didn't intend for this post to be public - it's an experimental thing we tried as a test - however, now it's been made public if anyone has any insight, it's very welcome.


Railway
BOT

22 days ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open Railway 22 days ago


17 days ago

Could you clarify how Railway is related to the issues you're experiencing with RunPod? I'm not sure I follow


Status changed to Awaiting User Response Railway 17 days ago


Railway
BOT

10 days ago

This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!

Status changed to Solved Railway 10 days ago


Loading...