CLI recently started timing out on Create New Environment with --duplicate Flag.

freemarmoset

PROOP

25 days ago

`railway environment new --duplicate` orphans an empty environment when duplication exceeds the CLI's hardcoded 30s timeout

Summary

railway environment new <name> --duplicate <source> is not atomic. The CLI does it as separate GraphQL requests:

Create a brand-new empty environment.
Fetch the source environment's config.
Apply that config to the new env (this is the step that copies services + volumes).

All of these share one hardcoded 30-second HTTP timeout in the CLI's GraphQL client. When step 3 takes longer than 30s, the CLI aborts with operation timed out — but the empty environment from step 1 already exists and is left behind, with zero services and no volume.

That orphan ("husk") is permanently broken: it exists, so re-running either skips duplication (name already present) or errors an environment with that name already exists, yet it has none of the duplicated services. Recovery means manually deleting it and retrying — and the retry hits the same 30s race. There is no flag, env var, or config to extend the timeout.

Impact

CI that duplicates an environment per PR fails intermittently and leaves orphaned empty environments (which also leak volumes/cost).
Silent by default: the empty env was created, so partial failure is indistinguishable from success unless the caller adds its own "does this env have services?" check.
Non-deterministic: the same command against the same source usually succeeds in ~6s, but times out at 30s a meaningful fraction of the time — purely a function of backend latency on the config-apply step.

Environment

CLI: 4.65.0 (latest; also seen on 4.64.0), installed via npm install -g @railway/cli
Backend: https://backboard.railway.com/graphql/v2
Source env: 3 services (web, worker, Redis) + one ~1 GB Postgres volume

Evidence (real run, CLI 4.65.0)

> Environment name pr-843
> Duplicate from staging
Failed to fetch: error sending request for url (https://backboard.railway.com/graphql/v2)
Caused by:
    0: error sending request for url (https://backboard.railway.com/graphql/v2)
    1: operation timed out
Process completed with exit code 1.

The gap between "Duplicate from staging" and the error is 30.1 seconds — exactly the hardcoded client timeout.

Resulting backend state (queried right after): environment pr-843 exists (created ~20s in) with 0 service instances and no volume. On the same CLI version, a sibling run duplicated the same source successfully in ~6s. So this is backend config-apply latency vs. the 30s cap — not a malformed request or a client-version issue.

Root cause (source, v4.65.0)

src/client.rs build_client() — hardcoded, no override:

Client::builder()
    .default_headers(headers)
    .timeout(Duration::from_secs(30))   // used by post_graphql() for EVERY request
    .build()

src/commands/environment/new.rs new_environment() — non-atomic:

// Step 1: create empty env (source_id: None — backend's atomic-duplicate path is NOT used)
let response = post_graphql::<EnvironmentCreate, _>(...).await?;  // empty env now EXISTS
let env_id = response.environment_create.id.clone();

if let Some(source_env_id) = duplicate_id {
    let source_config = fetch_environment_config(...).await?;     // 30s cap
    let source_instances = get_environment_instances(...).await?; // 30s cap
    apply_environment_config(&client, &configs, &env_id, ...).await?;  // 30s cap — copies services + volume
}
// On timeout/error here, the env from Step 1 is never cleaned up -> husk.

The EnvironmentCreate mutation already accepts a source_id (atomic server-side duplicate), but the CLI passes None and reimplements duplication client-side across multiple round-trips — which is what creates the partial-failure/husk window.

Requests

Make duplicate atomic, or clean up on failure. If any step after the empty-env creation fails (including a timeout), roll back / delete the partial environment instead of leaving a husk — or use the backend's atomic environmentCreate(sourceEnvironmentId) path. This is the real fix: a longer timeout alone still orphans environments when the copy fails partway.
Make the client timeout configurable (RAILWAY_HTTP_TIMEOUT env var or --timeout flag). 30s is too short to duplicate a multi-service + ~1 GB-volume environment, and there's currently no escape hatch.
Investigate the config-apply latency — duplicating this environment intermittently exceeds 30s where it used to complete in single-digit seconds.

note: I selected a service below as an example, but this actually applies to all services

Screenshot 2026-05-29 at 10.07.17 AM.png

Attachments

Screenshot%...

Solved

9 Replies

Status changed to Awaiting Railway Response Railway • 25 days ago

mykal

EMPLOYEE

25 days ago

Hey! Sorry about the impact here.

Nothing has changed API surface wise but we can see there has been a slight uptick in 504s on service creation API calls as of late.

I've escalated this up to our product team so they can take a look; hopefully we'll have a resolution for you here soon. I'll also forward this up to our CLI team to see if they can make transactions like this atomic

Let me know if there's anything else I can help with while the team looks into it further.

Status changed to Awaiting User Response Railway • 25 days ago

freemarmoset

PROOP

23 days ago

Hi, thanks for the reply. Just FYI I did post as an issue to the the CLI team as well: https://github.com/railwayapp/cli/issues/923

I wasn't sure if it was a CLI issue or an infrastructure issue.

I will just note, though, that this is more than a slight uptick for our team personally. I'd say it happens the majority of the time right now. Roughly 75% of our initial attempts fail, and then deleting the environment and retrying will fix it on the second or third try.

Status changed to Awaiting Railway Response Railway • 23 days ago

mykal

EMPLOYEE

23 days ago

Hey, I'm sorry you're seeing such an elevated error rate here.

Thanks for opening up that issue over on the CLI as well.

as mentioned previously, I have forwarded this up to the team and they're looking into the root cause here. We'll be sure to reach back out when we have more information.

Again, sorry for the impact. Hopefully I can push over more information soon.

Status changed to Awaiting User Response Railway • 23 days ago

pol

PRO

20 days ago

Hey, we are seeing the same issue on our side. Is there any update on this? A timeout override as a flag would be a great workaround for us as well.

Status changed to Awaiting Railway Response Railway • 20 days ago

mark-antal-csizmadia

HOBBY

20 days ago

Same here, elevated error levels for this