Guardrail: Deployment RCA — Retry Caps, Path Lock, Bootstrap Immutability¶
Rule: When a deployment fails, extract the failure signature before retrying. After repeated failure of the same class, switch to RCA-only mode. Do not mutate bootstrap-owned state during incident execution.
Retry Cap Guardrail¶
Repeating a failed deployment without extracting the failure signature causes execution drift. The deployment context accumulates stale state, subsequent failures become harder to diagnose, and retry loops mask the root cause.
Enforcement:
- First failure: Extract the stage name and exact error signature before any rerun.
- Second failure of the same class: Stop executing. Route to RCA-only mode.
- RCA-only mode: All deployment commands are suspended. The session goal is diagnosis, not re-execution.
To transition back to execution mode, the user must:
- Confirm the root cause is identified.
- Confirm the remediation step is distinct from prior attempts.
- Re-issue a
deploy:intent explicitly to resume execution.
Identifying a Repeated Failure¶
A failure is "repeated" when any of the following are true:
- The same stage or resource fails in consecutive attempts.
- The error signature (exit code, provider error code, resource type) matches the prior failure.
- The user message contains
rca,stop and rca,why is it failing,same error, orstill broken. - The deployment has been retried three or more times without a confirmed root cause between attempts.
Path Hygiene Guardrail¶
Deployments running in an incorrect workspace directory produce silent drift: scripts reference wrong paths, state files diverge, and infrastructure changes land in the wrong environment.
At session start for any in-place deployment session:
- Confirm the canonical repo path from bootstrap configuration or the
BASECOAT_REPOenvironment variable. - Verify the session working directory matches that canonical path.
- If a second checkout of the same repo exists at a different path, warn before proceeding.
Block or warn when:
| Condition | Action |
|---|---|
| Session path differs from canonical path and another checkout exists | Warn — require explicit user confirmation to continue |
BASECOAT_REPO or equivalent env var is unset |
Warn — path hygiene cannot be verified |
| Git remote origin differs from expected repository | Block — session may be operating on the wrong repo |
Bootstrap Immutability Protocol¶
During an incident or active deployment session, bootstrap-owned variables and secrets are treated as immutable. Changing them mid-execution shifts the environment state, invalidates prior failure context, and risks the next retry running against a different configuration than the one that failed.
During incident or deployment execution, do not:
- Modify GitHub Actions secrets, environment variables, or OIDC configurations owned by bootstrap.
- Re-run bootstrap workflows to rotate credentials unless the user explicitly requests re-bootstrap.
- Change
azure.yaml,.env, or environment-specificinfra/files that bootstrap manages.
Prefer instead:
- Local or ephemeral overrides that do not persist beyond the session.
- Explicit re-bootstrap only after the incident is resolved and the session is in a clean state.
Permitted exceptions (require explicit user request with stated reason):
- Rotating a compromised secret during an active security incident.
- Correcting a misconfigured bootstrap value that is the confirmed root cause of the current failure.
Firewall Discipline — Non-Optional Sequence¶
"Disable the firewall" is not a valid troubleshooting step. Weakening network controls to unblock a deployment introduces a security exposure that outlasts the incident.
Required sequence (non-negotiable):
- Capture current firewall rules.
- Add the runner IP using
azCLI (not Terraform — Terraform causes state drift). - Perform the deployment work.
- Remove the runner IP in an
always()cleanup step.
Forbidden troubleshooting shortcuts:
az storage account update --default-action Allow— disables the firewall entirely.- Commenting out firewall rule steps to test without network restrictions.
- Leaving runner IPs in place after a deployment completes.
See instructions/basecoat-60-workflow-ci-firewall.instructions.md
for the full pattern, YAML template, and anti-patterns.
RCA Routing Before Re-Execution¶
When a prompt contains any of the following signals, route to triage/audit mode before allowing any execution path:
| Signal | Action |
|---|---|
Contains rca or stop and rca |
Suspend execution. Enter RCA-only mode. Route to @rca. |
| Same error repeated two or more times | Force RCA step before next retry. |
| User asks "why" after a failure | Treat as RCA signal. Do not retry until root cause is stated. |
| Deployment retried three or more times | Hard block on further retries. Require rca: prefix to diagnose. |
References¶
- Deployment cancellation pre-flight:
deployment-cancellation.md - Firewall pattern:
instructions/basecoat-60-workflow-ci-firewall.instructions.md - Intent routing:
instructions/basecoat-10-core-intent-routing.instructions.md - Bootstrap structure:
instructions/basecoat-10-core-bootstrap-structure.instructions.md - CI concurrency:
ci-concurrency.md