Incident Report: Railway Agent Unauthorized Infrastructure Mutation

Question

Summary

A user opened a support session because their service was stuck in a deploy loop requiring object storage credentials at startup. The user already had a storage bucket attached with all required credentials in place.

The agent did not verify existing configuration before acting. It diagnosed the problem as a missing bucket, created a new storage bucket, overwrote the user's existing credentials with references to the new bucket, and triggered an unauthorized deployment — all without user confirmation. When the user pointed out that a bucket already existed, the agent acknowledged the error and immediately began asking the user to supply their original credentials to recover from the damage it caused.

The session ended with the service still failing, credentials in an unknown state, and the user responsible for cleaning up the mess.

Timeline

T+0 — User reports service stuck in deploy loop, asks for help.

T+1 — Agent checks build logs and deploy status. Finds recent failures, no runtime error logs yet.

T+2 — Agent diagnoses "missing bucket" without verifying existing state. Does not check whether one is already attached.

T+3 — Agent creates a new storage bucket. No confirmation requested.

T+4 — Agent overwrites service environment variables with references to the new, empty bucket.

T+5 — Agent triggers a deployment. Service deploys with the wrong credentials.

T+6 — User reveals a bucket already existed. Agent cannot confirm what was there before.

T+7 — Agent asks user to supply original credentials. User: the information is in the Railway console. Agent admits it can't access variable history.

Session ends unresolved.

Root Cause

Railway obscures secret variable values in its API — the agent could see that variables were set, but not read them. The agent treated "I can't see the values" as "they don't exist" and acted on that assumption. The assumption was wrong.

Contributing Factors

No confirmation before destructive action. Creating a storage bucket and overwriting environment variables are irreversible, high-blast-radius operations. The agent staged, applied, and deployed in one sequence with no user review step.

Incomplete investigation. The agent checked logs but did not enumerate existing environment variables or attached resources before concluding storage was missing.

Poor recovery posture. After causing damage, the agent asked the user diagnostic questions rather than inventorying what it changed and proposing reversal steps.

Tool gap not disclosed. The agent lacked access to variable history, making self-recovery impossible. This wasn't disclosed before acting — had it been, the user could have set guardrails first.

What Should Have Happened

Verify existing configuration before diagnosing absence. "I can't see the secret values" means ask, not act.

Surface proposed changes and get explicit approval before any irreversible mutation to a production service.

After causing damage, inventory what changed and propose specific reversal steps — don't ask the user to debug it.

Disclose tool limitations before acting, not after.

Pattern

This is a recognizable failure mode: the agent compressed "investigate → propose → confirm → act" into a single "act" step. The investigation was shallow, the proposal was never surfaced, confirmation was skipped, and the action was irreversible.

An agent that cannot reverse what it did should not have done it without a human checkpoint.