25 days ago
Child Python subprocesses spawned by a long-lived Prefect process worker (prefect worker start --pool ...) exit silently with
code -11 (SIGSEGV), ~5–10 seconds after spawn, during Python module import. The parent worker container stays alive — only the
children die. Affects both prefect-worker and prefect-ml-worker in our dev environment, on
python:3.11.8-slim-bookworm.
Evidence
- Loki shows 3 subprocesses starting within 30 ms (17:45:57.976, 17:45:58.005, 17:45:58.006), all reaching module-level imports,
then dying by 17:46:00.494.
- Zero matches across 28,709 worker log lines on 2026-05-18 for OOMKilled, Segmentation fault, Fatal Python error, terminated by
signal, etc. Crash record exists only in Prefect API logs (Flow run process exited with status code: -11).
- Railway list-deployments confirms container was NOT restarted at any crash time — same deployment ID through each crash window.
- Image loads heavy native deps per child: grpcio 1.78.0, cryptography 46.0.5, protobuf 6.33.5, mlflow 3.10.1, opentelemetry
0.60b1. We suspect a C-extension init race under concurrent fork/exec.
Questions
- Does Railway emit any platform-level event when a process inside a container is killed by the kernel (cgroup OOM, signal,
scheduler eviction) that wouldn't appear on stdout? We're not seeing OOMKilled anywhere.
- What memory cap is configured on these two services? (list-services via your CLI returns "Connection reset by peer"
intermittently.)
- Can the runtime capture a core dump from a child subprocess that segfaults?
- Have you seen this pattern from other customers on python:3.11.8-slim-bookworm + grpcio ≥ 1.78?
1 Replies
25 days ago
This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.
Status changed to Open Railway • 25 days ago
25 days ago
This does not look like a Railway deployment/container restart. Exit code -11 is SIGSEGV; a cgroup OOM kill is normally SIGKILL / exit 137. If PID 1 (prefect worker start ...) stays alive and only the spawned flow subprocess dies, Railway will not necessarily show a deployment restart because the container itself did not exit.
For the Railway-specific parts:
-
I would not expect a separate Railway platform event for a non-PID-1 child process segfaulting inside a still-running container. Railway exposes deployment logs plus service CPU/memory/disk/network metrics, but not per-child-process crash records. If this were an OOM/resource-limit issue, check Railway Metrics and the container cgroup counters.
-
You can check the effective memory limit and OOM counters from inside the running container:
cat /sys/fs/cgroup/memory.max 2>/dev/null
cat /sys/fs/cgroup/memory.current 2>/dev/null
cat /sys/fs/cgroup/memory.peak 2>/dev/null
cat /sys/fs/cgroup/memory.events 2>/dev/nullOn cgroup v2, memory.events is the important one. If oom / oom_kill does not increment when the Prefect child dies, this was not a kernel cgroup OOM kill.
You can also check Railway’s memory metrics with:
railway metrics --service prefect-worker --memory --since 6h --json
railway metrics --service prefect-ml-worker --memory --since 6h --json
Railway’s documented per-replica limits are 8 GB on Hobby and 24 GB on Pro, unless you configured a lower Replica Limit in Service Settings → Deploy → Replica Limits.
For core dumps, I would not rely on host-level core capture on Railway. You can try enabling core files in the container, but /proc/sys/kernel/core_pattern is often host-controlled in managed runtimes. A more reliable first step for Python native crashes is:
export PYTHONFAULTHANDLER=1
export PYTHONMALLOC=debug
and then reproduce with import tracing:
python -X faulthandler -v -c 'import grpc, cryptography, google.protobuf, mlflow, opentelemetry'
Given your timing, I would test this as a native-extension/import-time crash rather than a Railway kill. Temporarily reduce the worker to one child / one flow at a time. If serial execution stops the crashes, that strongly supports the concurrent native dependency init theory.
I would also pin-test the likely native packages one at a time, especially grpcio and protobuf:
grpcio==1.76.x
protobuf<6
and test a newer base image such as python:3.11-bookworm or python:3.12-slim-bookworm against the exact same app. python:3.11.8-slim-bookworm is old enough that I would verify on a current patch image before spending much more time debugging Railway itself.
Short version: Railway is probably not killing these subprocesses. Confirm with memory.events; if oom_kill is unchanged and the parent container/deployment remains alive, treat this as a segfault in a native Python dependency during concurrent child startup.
If this answer helps solve it, please mark it as the solution so others can find it too.
Status changed to Awaiting User Response Railway • 25 days ago
18 days ago
This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!
Status changed to Solved Railway • 18 days ago