The multi-agent coding OS that makes AI agents ship reliable software. Deploy autonomous squads that architect, write, and audit production-ready systems.
Becky is a multi-agent coding OS that combines role-based pipelines, an agent-maintained knowledge base, closed learning loops, and dual-runtime coordination between Claude and Codex.
Instead of one monolithic AI assistant that forgets everything between sessions, Becky deploys a squad of 7 specialized agents governed by 10 immutable rules. Every mistake becomes a rule. Every rule compounds intelligence. The system literally gets smarter with each project.
Legacy AI coding architectures collapse under agentic hallucinations and fragmented memory. These are the six failure modes Becky was built to eliminate.
Agents that mark their own homework, leading to systemic bias and recursive logic failures that human developers cannot audit.
Loss of critical project context during long development cycles, forcing agents to hallucinate architecture that does not exist.
Desynchronization between local and production environments causing "works on my agent" bugs that haunt deployment pipelines.
Hidden errors that compound across multi-agent systems until the entire codebase collapses under invisible technical debt.
Fake progress reports hiding incomplete logic. 100% completion metrics that mask non-functional or unoptimized code blocks.
Systems that repeat the same errors without memory or improvement. Every crash is a new experience instead of a learned lesson.
Seven immutable core protocols that form the neural architecture of every Becky-powered project.
10 immutable rules that govern all agent behavior. Mistakes become rules. Rules compound intelligence across every project.
7 specialized agents with distinct roles, personas, and chain-of-thought prompts. Each owns a domain and produces typed artifacts.
Persistent wiki pages per project domain. Agents read before acting and write after learning. Knowledge survives session boundaries.
Greenfield mode runs the full role-based pipeline from brief to deploy. Brownfield mode indexes existing code and slots agents into the gaps.
MEMORY.md auto-captures decisions, blockers, and architecture choices. Every session starts with full context, zero cold-start amnesia.
Build-Verify-Learn-Encode. Every incident creates a retrospective. Every retro distills into a rule. Rules prevent recurrence.
Claude handles architecture, planning, and complex reasoning. Codex handles bulk implementation, testing, and file operations. The bridge keeps them synchronized through shared wiki, rules, and artifact contracts.
Each agent owns a domain, produces typed artifacts, and is governed by the same 10 rules.
Owns the PRD. Runs elicitation, writes acceptance criteria, manages the product backlog and story prioritization.
Designs system architecture, ADRs, data models, and API contracts. Validates technical feasibility against the PRD.
Creates UX specs, interaction patterns, design system tokens, and validates accessibility. Works from PRD, not assumptions.
Implements stories, writes migrations, builds APIs and UI. Follows architecture contracts and acceptance criteria exactly.
Writes test plans, runs Playwright e2e tests, validates acceptance criteria. Tests like a human -- clicks buttons, fills forms.
Independent verification agent. Never trusts self-reported completion. Validates DONE/VERIFIED/AUDITED against runtime evidence.
Maintains wiki pages, distills retrospectives into rules, keeps MEMORY.md current. The institutional memory of the system.
Two modes, one protocol. Each step has a gate -- nothing advances without verification.
Fury runs 4-5 elicitation methods, produces product brief, then full PRD with acceptance criteria for every feature.
Strange produces ADRs, data models, API contracts. Shuri produces interaction specs, design tokens, screen flows.
Epics decomposed into stories with full context. Each story file contains everything an agent needs to implement it independently.
Stark implements, Widow validates with Playwright tests, Heimdall confirms DONE/VERIFIED/AUDITED. Learning loop encodes lessons.
Three tiers of truth. No agent can claim DONE without runtime evidence. No exceptions.
| State | Evidence Required |
|---|---|
| DONE | Runtime artifact: API response, DB query result, browser screenshot proving the feature works |
| VERIFIED | Acceptance criteria reviewed line-by-line with code citations for each point |
| AUDITED | File existence confirmed, LOC counted, function names match spec |
Build. Verify. Learn. Encode. Every mistake becomes a rule. Rules compound intelligence.
Post-sprint review extracts what worked, what failed, and what to encode. Produces structured findings, not vague notes.
Every P0 produces a root cause analysis that becomes a rule in CLAUDE.md. The same failure pattern never ships twice.
When agents discover reusable patterns, Watcher distills them into wiki pages and rules. Knowledge compounds across projects.
git clone https://github.com/becky-os/becky-os.git
cd becky-os
npx becky init . # installs slash commands + .becky/ workspace
npx becky scan . # analyze existing codebase
npx becky run
Clone Becky, run becky init . in your project. Copies 13 slash commands and creates a .becky/ workspace.
Run becky scan . to analyze your existing codebase, detect frameworks, and map what's done vs pending.
Add your project rules to CLAUDE.md. Start with defaults or bring your own constraints.
Greenfield for new projects, Brownfield for existing codebases. Becky adapts the pipeline.
Agents self-coordinate. The loop runs. Rules compound. Ship production-ready software.