Deploys failing at container creation 24h+ — diagnosis says platform issue

jonathannunezf-coder

PROOP

2 months ago

I've had multiple consecutive failed deploys over the last 24+ hours,

all failing at the same stage. Submitting here as a follow-up to a

private support email sent yesterday (no response yet).

What's failing:

- Build ✓ passes

- Image publish ✓ passes

- Create container ✗ fails consistently

- Network / Post-deploy never start

Railway's own diagnosis on the failure panel says verbatim:

"The build and image publish completed successfully, but the

deployment failed at the container creation stage. This same

failure pattern has occurred on previous deployments of this

service, pointing to a platform issue rather than a problem with

the code or configuration."

Context:

- App: https://ndversa.up.railway.app (production, real users)

- GitHub: https://github.com/jonathannunezf-coder/Autitinder

- Same commit deployed successfully earlier in the same timeframe,

so it's not a code-path change.

- This is blocking active recovery work for a data loss incident

on the mounted volume.

Has anyone hit this recently, or can a Railway engineer take a

look at the container creation pipeline for this project? Happy

to share project ID and deploy IDs privately if needed.

Thanks.

Solved

6 Replies

Status changed to Awaiting Railway Response Railway • about 2 months ago

brody

EMPLOYEE

2 months ago

Looks like you've had a successful deploy since, so this looks resolved on your end.

Status changed to Awaiting User Response Railway • about 2 months ago

jonathannunezf-coder

PROOP

2 months ago

▎ Thanks for the follow-up, but I think there's a misunderstanding. The successful deploy you're seeing is from ~3

▎ days ago (commit 8ed8c52) — that's the last one that worked. The deploys I'm reporting as broken are the ones I

▎ triggered yesterday (May 9–10) for commits 6c116ba, a564120, and 2ddd201. All three failed at the "Create container"

▎ stage with "Infrastructure Error":

▎

▎ - 4b25e043-ae0b-49bf-a480-df69b7289caa

▎ - 59bbf878-4f8f-4c49-972f-3e86d8f84b3f

▎ - b3812126-a244-464e-bdd6-390f199f299f

▎

▎ Build and image publish pass on all three. Container creation is where it consistently fails. Dockerfile,

▎ nixpacks.toml, and railway.toml are unchanged from the last successful deploy, so this isn't a config regression on

▎ my side. As of right now I still can't ship any new code to production — every deploy attempt fails at the same

▎ stage.

▎

▎ Could you check those three deploy IDs? Project ID d8713303-89af-4f64-8a53-9cd1a98a5a0f if you need it.

Status changed to Awaiting Railway Response Railway • about 2 months ago

jonathannunezf-coder

PROOP

2 months ago

▎ Following up — I just pushed two new commits (b614654, 09ef667) about ~10 minutes ago, and Railway just produced another failed deploy with the

▎ same "Infrastructure Error" at the "Create container" stage:

▎

▎ - 21da516e-f487-40b1-b009-62e123a6bee7 — failed (commit 09ef667)

▎

▎ So the issue is clearly still active. The "successful deploy" you saw is the one from commit 8ed8c52, which is 3 days old. Every deploy attempt

▎ since then has failed at the same stage, including this fresh one from a few minutes ago. Dockerfile, nixpacks.toml, and railway.toml are

▎ unchanged from the last working state.

▎

▎ Could a Railway engineer take a look at the container creation logs for the failed deploy above? Project ID:

▎ d8713303-89af-4f64-8a53-9cd1a98a5a0f. This service has been unable to ship any updates for ~3 days now.

codydearkland

EMPLOYEE

2 months ago

Hey Jonathan — your read on this is right; every deploy since the May 8 snapshot restore has been failing at container creation. Here's what's going on:

The snapshot restore created a new volume alongside the existing /data rather than replacing it. The new volume took its mount path from the source as it was at snapshot time — which had a trailing space (/data instead of /data). The two volumes also landed in different AWS Availability Zones (us-west-1a and us-west-1c). A single container only runs on one host, so every deploy lands on a host where one of the two volume devices doesn't exist locally — that fails with ENOENT and surfaces as "Infrastructure Error" with no container logs. The deploy panel's "platform issue" diagnosis was right.

Editing the original volume's mount path to clean up the trailing space afterward didn't help — the two-volumes-in-different-AZs state is what's actually breaking the deploy.

Your recovered data is intact. The snapshot volume is preserved with the full restore.

To unblock you we need to detach one of the two volumes. Before we do, what did you want the restore to do?

A. Keep the live /data running and just give you access to the recovered data. We detach the snapshot volume; your next deploy succeeds. You can then attach the snapshot volume to a different service at a clean mount path and copy what you need out of it.

B. Replace the live /data with the snapshot contents. We detach the original /data and reattach the snapshot volume as /data; your next deploy runs against the recovered state.

Tell us which and we'll do it now.

On the platform side: this state shouldn't have been creatable. We'll be filing fixes so snapshot restore can't attach a second volume across AZs, and so trailing whitespace in a mount path doesn't validate.

Status changed to Awaiting User Response Railway • about 2 months ago

jonathannunezf-coder

PROOP

2 months ago

Hey — option A please. Keep the live /data attached and serving, detach the snapshot volume. Our production data is

on the live one; the snapshot was observed empty when we mounted it back on May 9 (which is what kicked off this

whole thread), so we don't want to swap it in blind. Once you've detached it and our next deploy goes through, we'll

attach the snapshot to a side service at a clean mount path to inspect what's actually on it before deciding

whether anything needs to be copied across.

Thanks for the diagnosis and the platform-side fixes — the AZ split + trailing-whitespace combo was wild.

Status changed to Awaiting Railway Response Railway • about 2 months ago

nico

EMPLOYEE

2 months ago

Done - we've detached the snapshot volume. Your live /data volume is still attached and untouched. Your next deploy should go through. The snapshot volume is preserved, so whenever you're ready you can attach it to a side service at a clean mount path to inspect its contents.

Status changed to Awaiting User Response Railway • about 2 months ago

Status changed to Solved nico • about 2 months ago

Welcome!