CLI recently started timing out on Create New Environment with --duplicate Flag.
freemarmoset
PROOP

25 days ago

railway environment new --duplicate orphans an empty environment when duplication exceeds the CLI's hardcoded 30s timeout

Summary

railway environment new <name> --duplicate <source> is not atomic. The CLI does it as separate GraphQL requests:

  1. Create a brand-new empty environment.
  2. Fetch the source environment's config.
  3. Apply that config to the new env (this is the step that copies services + volumes).

All of these share one hardcoded 30-second HTTP timeout in the CLI's GraphQL client. When step 3 takes longer than 30s, the CLI aborts with operation timed outbut the empty environment from step 1 already exists and is left behind, with zero services and no volume.

That orphan ("husk") is permanently broken: it exists, so re-running either skips duplication (name already present) or errors an environment with that name already exists, yet it has none of the duplicated services. Recovery means manually deleting it and retrying — and the retry hits the same 30s race. There is no flag, env var, or config to extend the timeout.

Impact

  • CI that duplicates an environment per PR fails intermittently and leaves orphaned empty environments (which also leak volumes/cost).
  • Silent by default: the empty env was created, so partial failure is indistinguishable from success unless the caller adds its own "does this env have services?" check.
  • Non-deterministic: the same command against the same source usually succeeds in ~6s, but times out at 30s a meaningful fraction of the time — purely a function of backend latency on the config-apply step.

Environment

  • CLI: 4.65.0 (latest; also seen on 4.64.0), installed via npm install -g @railway/cli
  • Backend: https://backboard.railway.com/graphql/v2
  • Source env: 3 services (web, worker, Redis) + one ~1 GB Postgres volume

Evidence (real run, CLI 4.65.0)

> Environment name pr-843
> Duplicate from staging
Failed to fetch: error sending request for url (https://backboard.railway.com/graphql/v2)
Caused by:
    0: error sending request for url (https://backboard.railway.com/graphql/v2)
    1: operation timed out
Process completed with exit code 1.

The gap between "Duplicate from staging" and the error is 30.1 seconds — exactly the hardcoded client timeout.

Resulting backend state (queried right after): environment pr-843 exists (created ~20s in) with 0 service instances and no volume. On the same CLI version, a sibling run duplicated the same source successfully in ~6s. So this is backend config-apply latency vs. the 30s cap — not a malformed request or a client-version issue.

Root cause (source, v4.65.0)

src/client.rs build_client() — hardcoded, no override:

Client::builder()
    .default_headers(headers)
    .timeout(Duration::from_secs(30))   // used by post_graphql() for EVERY request
    .build()

src/commands/environment/new.rs new_environment() — non-atomic:

// Step 1: create empty env (source_id: None — backend's atomic-duplicate path is NOT used)
let response = post_graphql::<EnvironmentCreate, _>(...).await?;  // empty env now EXISTS
let env_id = response.environment_create.id.clone();

if let Some(source_env_id) = duplicate_id {
    let source_config = fetch_environment_config(...).await?;     // 30s cap
    let source_instances = get_environment_instances(...).await?; // 30s cap
    apply_environment_config(&client, &configs, &env_id, ...).await?;  // 30s cap — copies services + volume
}
// On timeout/error here, the env from Step 1 is never cleaned up -> husk.

The EnvironmentCreate mutation already accepts a source_id (atomic server-side duplicate), but the CLI passes None and reimplements duplication client-side across multiple round-trips — which is what creates the partial-failure/husk window.

Requests

  1. Make duplicate atomic, or clean up on failure. If any step after the empty-env creation fails (including a timeout), roll back / delete the partial environment instead of leaving a husk — or use the backend's atomic environmentCreate(sourceEnvironmentId) path. This is the real fix: a longer timeout alone still orphans environments when the copy fails partway.
  2. Make the client timeout configurable (RAILWAY_HTTP_TIMEOUT env var or --timeout flag). 30s is too short to duplicate a multi-service + ~1 GB-volume environment, and there's currently no escape hatch.
  3. Investigate the config-apply latency — duplicating this environment intermittently exceeds 30s where it used to complete in single-digit seconds.

note: I selected a service below as an example, but this actually applies to all services

Screenshot 2026-05-29 at 10.07.17 AM.png

Solved

9 Replies

Status changed to Awaiting Railway Response Railway 25 days ago


25 days ago

Hey! Sorry about the impact here.

Nothing has changed API surface wise but we can see there has been a slight uptick in 504s on service creation API calls as of late.

I've escalated this up to our product team so they can take a look; hopefully we'll have a resolution for you here soon. I'll also forward this up to our CLI team to see if they can make transactions like this atomic

Let me know if there's anything else I can help with while the team looks into it further.


Status changed to Awaiting User Response Railway 25 days ago


freemarmoset
PROOP

23 days ago

Hi, thanks for the reply. Just FYI I did post as an issue to the the CLI team as well: https://github.com/railwayapp/cli/issues/923

I wasn't sure if it was a CLI issue or an infrastructure issue.

I will just note, though, that this is more than a slight uptick for our team personally. I'd say it happens the majority of the time right now. Roughly 75% of our initial attempts fail, and then deleting the environment and retrying will fix it on the second or third try.


Status changed to Awaiting Railway Response Railway 23 days ago


23 days ago

Hey, I'm sorry you're seeing such an elevated error rate here.

Thanks for opening up that issue over on the CLI as well.

as mentioned previously, I have forwarded this up to the team and they're looking into the root cause here. We'll be sure to reach back out when we have more information.

Again, sorry for the impact. Hopefully I can push over more information soon.


Status changed to Awaiting User Response Railway 23 days ago


pol
PRO

20 days ago

Hey, we are seeing the same issue on our side. Is there any update on this? A timeout override as a flag would be a great workaround for us as well.


Status changed to Awaiting Railway Response Railway 20 days ago


mark-antal-csizmadia
HOBBY

20 days ago

Same here, elevated error levels for this


20 days ago

Will bump this to 60s for y'all. Please hold


Status changed to Awaiting User Response Railway 20 days ago



20 days ago

PR merged. New version should have it


freemarmoset
PROOP

15 days ago

I can confirm this has been working much better today for us. Thank you.


Status changed to Awaiting Railway Response Railway 15 days ago


Status changed to Solved noahd 14 days ago


Welcome!

Sign in to your Railway account to join the conversation.

Loading...