25 days ago
railway environment new --duplicate orphans an empty environment when duplication exceeds the CLI's hardcoded 30s timeout
Summary
railway environment new <name> --duplicate <source> is not atomic. The CLI does it as separate GraphQL requests:
- Create a brand-new empty environment.
- Fetch the source environment's config.
- Apply that config to the new env (this is the step that copies services + volumes).
All of these share one hardcoded 30-second HTTP timeout in the CLI's GraphQL client. When step 3 takes longer than 30s, the CLI aborts with operation timed out — but the empty environment from step 1 already exists and is left behind, with zero services and no volume.
That orphan ("husk") is permanently broken: it exists, so re-running either skips duplication (name already present) or errors an environment with that name already exists, yet it has none of the duplicated services. Recovery means manually deleting it and retrying — and the retry hits the same 30s race. There is no flag, env var, or config to extend the timeout.
Impact
- CI that duplicates an environment per PR fails intermittently and leaves orphaned empty environments (which also leak volumes/cost).
- Silent by default: the empty env was created, so partial failure is indistinguishable from success unless the caller adds its own "does this env have services?" check.
- Non-deterministic: the same command against the same source usually succeeds in ~6s, but times out at 30s a meaningful fraction of the time — purely a function of backend latency on the config-apply step.
Environment
- CLI: 4.65.0 (latest; also seen on 4.64.0), installed via
npm install -g @railway/cli - Backend:
https://backboard.railway.com/graphql/v2 - Source env: 3 services (web, worker, Redis) + one ~1 GB Postgres volume
Evidence (real run, CLI 4.65.0)
> Environment name pr-843
> Duplicate from staging
Failed to fetch: error sending request for url (https://backboard.railway.com/graphql/v2)
Caused by:
0: error sending request for url (https://backboard.railway.com/graphql/v2)
1: operation timed out
Process completed with exit code 1.The gap between "Duplicate from staging" and the error is 30.1 seconds — exactly the hardcoded client timeout.
Resulting backend state (queried right after): environment pr-843 exists (created ~20s in) with 0 service instances and no volume. On the same CLI version, a sibling run duplicated the same source successfully in ~6s. So this is backend config-apply latency vs. the 30s cap — not a malformed request or a client-version issue.
Root cause (source, v4.65.0)
src/client.rs build_client() — hardcoded, no override:
Client::builder()
.default_headers(headers)
.timeout(Duration::from_secs(30)) // used by post_graphql() for EVERY request
.build()src/commands/environment/new.rs new_environment() — non-atomic:
// Step 1: create empty env (source_id: None — backend's atomic-duplicate path is NOT used)
let response = post_graphql::<EnvironmentCreate, _>(...).await?; // empty env now EXISTS
let env_id = response.environment_create.id.clone();
if let Some(source_env_id) = duplicate_id {
let source_config = fetch_environment_config(...).await?; // 30s cap
let source_instances = get_environment_instances(...).await?; // 30s cap
apply_environment_config(&client, &configs, &env_id, ...).await?; // 30s cap — copies services + volume
}
// On timeout/error here, the env from Step 1 is never cleaned up -> husk.The EnvironmentCreate mutation already accepts a source_id (atomic server-side duplicate), but the CLI passes None and reimplements duplication client-side across multiple round-trips — which is what creates the partial-failure/husk window.
Requests
- Make duplicate atomic, or clean up on failure. If any step after the empty-env creation fails (including a timeout), roll back / delete the partial environment instead of leaving a husk — or use the backend's atomic
environmentCreate(sourceEnvironmentId)path. This is the real fix: a longer timeout alone still orphans environments when the copy fails partway. - Make the client timeout configurable (
RAILWAY_HTTP_TIMEOUTenv var or--timeoutflag). 30s is too short to duplicate a multi-service + ~1 GB-volume environment, and there's currently no escape hatch. - Investigate the config-apply latency — duplicating this environment intermittently exceeds 30s where it used to complete in single-digit seconds.
note: I selected a service below as an example, but this actually applies to all services
Attachments
9 Replies
Status changed to Awaiting Railway Response Railway • 25 days ago
25 days ago
Hey! Sorry about the impact here.
Nothing has changed API surface wise but we can see there has been a slight uptick in 504s on service creation API calls as of late.
I've escalated this up to our product team so they can take a look; hopefully we'll have a resolution for you here soon. I'll also forward this up to our CLI team to see if they can make transactions like this atomic
Let me know if there's anything else I can help with while the team looks into it further.
Status changed to Awaiting User Response Railway • 25 days ago
23 days ago
Hi, thanks for the reply. Just FYI I did post as an issue to the the CLI team as well: https://github.com/railwayapp/cli/issues/923
I wasn't sure if it was a CLI issue or an infrastructure issue.
I will just note, though, that this is more than a slight uptick for our team personally. I'd say it happens the majority of the time right now. Roughly 75% of our initial attempts fail, and then deleting the environment and retrying will fix it on the second or third try.
Status changed to Awaiting Railway Response Railway • 23 days ago
23 days ago
Hey, I'm sorry you're seeing such an elevated error rate here.
Thanks for opening up that issue over on the CLI as well.
as mentioned previously, I have forwarded this up to the team and they're looking into the root cause here. We'll be sure to reach back out when we have more information.
Again, sorry for the impact. Hopefully I can push over more information soon.
Status changed to Awaiting User Response Railway • 23 days ago
20 days ago
Hey, we are seeing the same issue on our side. Is there any update on this? A timeout override as a flag would be a great workaround for us as well.
Status changed to Awaiting Railway Response Railway • 20 days ago
20 days ago
Same here, elevated error levels for this
Status changed to Awaiting User Response Railway • 20 days ago
15 days ago
I can confirm this has been working much better today for us. Thank you.
Status changed to Awaiting Railway Response Railway • 15 days ago
Status changed to Solved noahd • 14 days ago
