MAESTRO-SSOT: Managed Autonomy in Multi-Agent Software Engineering via Single Source of Truth
Research Proposal (v1.0)
Research Direction: AI Agent / Multi-Agent Systems (Application Layer)
Target Venues: ICSE, FSE, ESEC/FSE, AAMAS, ASE, or top-tier SE/AI conferences
Estimated Timeline: 6–9 months to first submission
1. Project Overview
1.1 Title
MAESTRO-SSOT: Managed Autonomy in Multi-Agent Software Engineering via Single Source of Truth and Guarded Execution
1.2 One-Sentence Summary
We propose a multi-agent software engineering system where specialized LLM-based agents collaborate around a shared, structured state (Single Source of Truth) and execute within a restrictive harness, enabling continuous autonomous development loops without human intervention while maintaining strict safety and observability guarantees.
1.3 Core Hypothesis
Explicit shared-state mediation (SSOT) combined with runtime guardrails (Agent Harness) enables multi-agent software engineering systems to achieve higher integration success rates and lower communication overhead than message-passing or pipeline-based alternatives, while remaining safe enough for unsupervised autonomous execution.
2. Background & Motivation
2.1 The Rise of Agentic Software Engineering
Recent advances in LLM-based coding agents have demonstrated impressive capabilities on individual software engineering tasks. Systems such as SWE-agent, Devin, and OpenHands have achieved over 70% accuracy on SWE-bench Verified by decomposing issue resolution into iterative planning, tool use, and execution cycles.
However, these systems largely preserve a single-agent or loosely delegated execution model. Even recent multi-agent proposals (e.g., Agyn, AutoGen-based code teams) treat collaboration as message passing between isolated agents: tasks are dispatched, executed independently, and results are merged post-hoc.
2.2 The Hidden Cost of "Divide and Conquer"
Real-world software development is not merely task decomposition—it is a continuous process of negotiating shared understanding:
- A backend developer changes an API schema; the frontend developer must adapt
- A requirement engineer clarifies a constraint; architects and implementers must reconcile their mental models
- A test failure reveals an integration mismatch that spans multiple modules
Current multi-agent systems lack an explicit mechanism for maintaining consensus on evolving shared artifacts. Agents operate on private contexts, leading to:
- Integration failures: Agent A generates code assuming interface X; Agent B generates code assuming interface Y
- Redundant communication: Agents must repeatedly query each other to synchronize state
- Opacity: Humans cannot easily observe what each agent believes the current specification to be
- Unsafe autonomy: Without guardrails, an autonomous agent loop may waste tokens, corrupt files, or execute dangerous operations
2.3 Opportunity: From Message Passing to Shared State
We observe that human software teams solve these problems through shared artifacts: Git repositories, API specifications, project management boards, and design documents. These act as a single source of truth (SSOT) that all participants read from and write to.
We hypothesize that endowing multi-agent systems with an analogous explicit SSOT layer—combined with a restrictive execution harness—will yield substantial improvements in integration reliability, communication efficiency, and autonomous execution safety.
3. Problem Definition
3.1 Formal Problem Statement
Given a natural-language software requirement (R), a set of specialized agents (\mathcal{A} = {A_1, A_2, \ldots, A_n}), and a codebase (C), construct a system that:
- Maintains a shared structured state (S) (the SSOT) representing requirements, design contracts, execution plans, and current progress
- Allows agents to atomically read from and write to (S) under concurrency control
- Executes agent actions within a restricted harness (H) that enforces safety policies
- Runs an autonomous control loop that dispatches agents, validates outcomes, and terminates only when (R) is satisfied or an unrecoverable failure occurs
3.2 Key Research Questions
RQ1 (State Mediation): Does an explicit SSOT reduce integration failures and communication overhead compared to message-passing multi-agent architectures?
RQ2 (Safety): Can an Agent Harness provide sufficient isolation and guardrails to enable safe unsupervised autonomous execution without sacrificing task completion rates?
RQ3 (Autonomy Spectrum): What is the optimal trade-off between agent autonomy (continuous execution without human approval) and safety (restrictive guardrails)?
RQ4 (Scalability): How does the SSOT architecture scale with the number of agents and the complexity of the shared state?
4. Proposed System: MAESTRO-SSOT
4.1 Design Principles
- SSOT as a First-Class Citizen: The shared state is not a log or message bus—it is a structured, queryable, versioned artifact that agents actively manipulate
- Agent Harness as Policy Enforcement: Safety is not an afterthought; it is enforced at the action layer via explicit allowlists, budgets, and rollback triggers
- Set-and-Forget Autonomy: Once a human provides a requirement, the system runs until completion, failure, or an unambiguous escalation condition
- Observability by Design: The SSOT serves as a live dashboard for human supervisors
4.2 System Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ Human Supervisor │
│ (injects requirements, monitors) │
└──────────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ SSOT Hub (Shared State) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │
│ │ Requirements │ │ Contract │ │ Execution │ │ Agent │ │
│ │ Tree │ │ Registry │ │ Log │ │ Memory │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └───────────┘ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ SSOT Access Control │ │
│ │ (read/write locks, versioning, conflict detection) │ │
│ └──────────────────────────────────────────────────────────────┘ │
└──────────────────────────────┬──────────────────────────────────────┘
│ publish / subscribe / query
┌──────────────────────┼──────────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Planning │ │ Execution │ │ Review │
│ Agent │◄────►│ Agents │◄────►│ Agents │
│ │ │ (Backend, │ │ │
│ decomposes │ │ Frontend, │ │ validates │
│ requirements│ │ Test, etc.) │ │ contracts │
└──────────────┘ └──────┬───────┘ └──────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Agent Harness (Guard Layer) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Sandbox │ │ Permission │ │ Budget │ │ Rollback │ │
│ │ (isolated │ │ (ACL for │ │ (token & │ │ (undo on │ │
│ │ execution)│ │ files/APIs)│ │ step caps) │ │ failure) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Action Validator (pre-execution) │ │
│ └──────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Auto-Loop Controller │
│ while not done: │
│ 1. PLAN ──► decompose requirement via Planning Agent │
│ 2. ASSIGN ──► dispatch subtasks to Execution Agents │
│ 3. EXEC ──► run actions through Harness │
│ 4. VALID ──► Review Agents check contracts & tests │
│ 5. COMMIT ──► update SSOT; on failure, rollback & retry │
│ 6. CHECK ──► verify requirement satisfaction; loop or exit │
└─────────────────────────────────────────────────────────────────────┘4.3 The SSOT Hub: Detailed Design
The SSOT is not a key-value store. It is a semantically structured workspace with four primary domains:
4.3.1 Requirements Tree
- Hierarchical decomposition of the top-level requirement (R)
- Each node:
(id, description, status, assignee, acceptance_criteria, dependencies) - Status enum:
PENDING | IN_PROGRESS | BLOCKED | REVIEW | DONE | FAILED - Agents claim nodes, update status, and report completion
4.3.2 Contract Registry
- Explicit interface contracts between agents/modules
- Each contract:
(contract_id, owner_module, consumer_modules, schema, version, status) - Status enum:
DRAFT | STABLE | DEPRECATED | BROKEN - Critical invariant: Before an agent modifies an exported interface, it must update the contract. Consumers automatically detect version changes.
4.3.3 Execution Log
- Append-only record of all agent actions and their outcomes
- Entry:
(timestamp, agent_id, action_type, target, input_hash, output_hash, status, error_if_any) - Enables full reproducibility and forensic analysis
4.3.4 Agent Working Memory
- Short-term context that agents persist across steps (e.g., research notes, partial solutions)
- Distinct from execution log: this is the agent's "scratchpad," not the system's audit trail
4.3.5 SSOT Access Control & Concurrency
- Read: Any agent may query any domain (encourages transparency)
- Write: Agents acquire locks on the specific sub-tree or contract they intend to modify
- Conflict Detection: If two agents write to overlapping contracts within the same planning epoch, the second write is rejected and the agent is notified to reconcile
- Versioning: All domains support atomic snapshots, enabling rollback to any prior state
4.4 Agent Harness: Detailed Design
The Harness wraps every agent action. It is not merely a Docker sandbox—it is a policy-aware action interceptor.
4.4.1 Sandbox Layer
- Agent code execution runs in isolated containers (Docker or e2b.dev cloud sandboxes)
- Network egress restricted to allowlisted domains (e.g., PyPI, npm registry)
- File system access limited to the project workspace + temporary directories
4.4.2 Permission Layer (ACL)
- Per-agent file access rules:
backend_agentmay write tosrc/api/andtests/api/, but notsrc/ui/ - Per-agent tool allowlists:
frontend_agentmay callnpm,vite,jest, but notdockerorrm -rf - Per-agent API rules: restrictions on external service calls (e.g., no production database connections)
4.4.3 Budget Layer
- Token Budget: Max cumulative tokens per agent per task (prevents infinite LLM loops)
- Step Budget: Max number of tool calls per agent per epoch (prevents runaway execution)
- Time Budget: Max wall-clock time per subtask (prevents hanging processes)
- When a budget is exhausted, the agent is paused and the Auto-Loop either retries with a different strategy or escalates to the human
4.4.4 Rollback Layer
- Before any agent writes to the codebase, the Harness creates a snapshot (Git commit or overlayfs checkpoint)
- If subsequent validation (tests, contract checks) fails, the system automatically reverts the snapshot
- The failed attempt is logged in the Execution Log for learning/avoidance
4.4.5 Action Validator (Pre-Execution)
Before an action reaches the sandbox, it passes through a fast rule-based validator:
- Does this action violate any ACL rule?
- Does this action modify a file outside the agent's assigned scope?
- Does this action delete more than N lines without explicit confirmation?
- Does this action match a known dangerous pattern (e.g.,
eval(),os.system()with user input)?
If validation fails, the action is rejected immediately with an explanatory error returned to the agent.
4.5 Auto-Loop Controller: Detailed Design
The Auto-Loop is the orchestration brain. It is not an LLM itself; it is a deterministic state machine with LLM-powered decision points.
class AutoLoop:
STATE: Enum = PLAN | ASSIGN | EXEC | VALID | COMMIT | CHECK | DONE | FAIL
def run(self, requirement: str):
ssot.initialize(requirement)
while self.state not in (DONE, FAIL):
match self.state:
case PLAN:
plan = planning_agent.decompose(ssot.requirements)
ssot.requirements_tree.commit(plan)
self.state = ASSIGN
case ASSIGN:
for task in ssot.requirements_tree.pending():
agent = scheduler.select_best_agent(task)
agent.claim(task)
self.state = EXEC
case EXEC:
for agent in ssot.active_agents():
result = harness.run(agent, agent.next_action())
ssot.execution_log.append(result)
self.state = VALID
case VALID:
review_results = [review_agent.check() for review_agent in reviewers]
contracts_valid = contract_validator.check_all()
tests_pass = test_runner.run_all()
if all(review_results) and contracts_valid and tests_pass:
self.state = COMMIT
else:
harness.rollback_all()
self.state = PLAN # retry with feedback
case COMMIT:
ssot.snapshot() # tag a clean state
self.state = CHECK
case CHECK:
if requirement_satisfier.verify(ssot, requirement):
self.state = DONE
elif ssot.retry_budget_exhausted():
self.state = FAIL
else:
self.state = PLAN # refine remaining workKey Design Decisions:
- The loop is deterministic except for LLM calls inside agents; this makes debugging and reproducibility tractable
- Rollback is epoch-level, not single-action: if any validation fails, the entire epoch is reverted, ensuring the SSOT and codebase remain consistent
- Retry budget: The system is allowed N full-cycle retries before declaring failure and notifying the human
4.6 Agent Roles
We propose a minimal but sufficient set of agent roles:
| Role | Responsibility | SSOT Read Access | SSOT Write Access |
|---|---|---|---|
| Planning Agent | Decompose requirements, resolve ambiguities, detect contradictions | All | Requirements Tree |
| Backend Agent | Implement server-side logic, APIs, database schemas | All | Contract Registry (API contracts), Codebase (src/api/) |
| Frontend Agent | Implement UI components, client-side logic | All | Codebase (src/ui/), Contract Registry (consumes API contracts) |
| Test Agent | Write and run unit/integration tests, report coverage | All | Codebase (tests/), Execution Log |
| Review Agent | Check contract compliance, code quality, security issues | All | Execution Log (annotations) |
| Contract Validator | (Non-LLM rule engine) Verify that all consumers match their declared contract versions | Contract Registry | Contract Registry (status updates) |
5. Technical Implementation Strategy
5.1 Build vs. Reuse
We adopt a "thin skeleton + core novelty" strategy. We do not rebuild general-purpose agent infrastructure; we build the SSOT, Harness, and Auto-Loop layers atop proven foundations.
| Component | Reuse Decision | Rationale |
|---|---|---|
| LLM API abstraction | litellm | Unified interface for OpenAI, Anthropic, local models; 1-line provider switching |
| Agent base abstraction | PydanticAI (or AutoGen v0.4 Core) | Lightweight structured agent framework; supports typed tool calls and structured outputs |
| Code sandbox | e2b.dev (or Docker SDK) | Production-grade isolated execution without maintaining our own container infra |
| Git operations | GitPython | Battle-tested library for programmatic repo manipulation |
| AST parsing / code analysis | tree-sitter + ** Jedi** | Extract API signatures for contract registry without writing custom parsers |
| Test execution | pytest / jest CLI wrappers | No reinvention needed |
5.2 Self-Built Components (Novelty Layer)
These are the components we implement from scratch and constitute the paper's core contribution:
ssot/— The shared state layer (~800–1200 LOC)- Domain models: RequirementsTree, ContractRegistry, ExecutionLog, AgentMemory
- Access control: LockManager, VersionManager, ConflictDetector
- Persistence: JSON/YAML snapshotting with Git-backed versioning
harness/— The guard layer (~600–1000 LOC)- Sandbox wrapper with ACL enforcement
- Budget tracker (token, step, time)
- Rollback orchestrator (Git snapshots or overlayfs)
- Action validator (rule-based pre-filter)
loop/— The autonomous controller (~400–600 LOC)- Deterministic state machine
- Scheduler for agent-task assignment
- Retry budget manager
- Human escalation gateway
agents/— Role-specific agents (~600–1000 LOC)- Prompt templates and tool bindings for each role
- SSOT read/write integration
Total estimated codebase for paper: 2500–4000 lines of Python.
6. Experimental Design
6.1 Datasets & Benchmarks
We design experiments around two complementary evaluation sets:
6.1.1 Contract-Bench (Self-Constructed, ~80 tasks)
A novel benchmark specifically designed to test multi-module collaboration. Each task requires at least two agents to produce code that integrates correctly.
Example tasks:
- "Implement a REST API with JWT authentication and a React frontend that uses it. Include login/logout flows and protected route handling."
- "Build a Python CLI tool that reads from a SQLite database and exports CSV. The database schema must be created by a separate module and shared via a typed interface."
- "Implement a real-time chat WebSocket server and a simple HTML client. Messages must be persisted to Redis and broadcast to all connected clients."
Evaluation criteria per task:
- Compilation / dependency resolution success
- All integration tests pass
- Contract consistency verified (no caller invokes a non-existent endpoint)
- Code quality (linting, type checking)
6.1.2 SWE-bench Multi (Filtered Subset, ~50 tasks)
We filter SWE-bench Pro / SWE-EVO to retain only issues that:
- Touch ≥3 source files
- Involve cross-module changes (e.g., API change + consumer update)
- Have clear test criteria
This subset specifically rewards systems that can coordinate changes across modules.
6.2 Baselines
| Baseline | Description | Why Compare |
|---|---|---|
| Single-Agent | SWE-agent style: one agent handles the entire task | Establishes whether multi-agent is beneficial at all |
| Agyn | State-of-the-art multi-agent SE (role-based pipeline, message passing) | Direct competitor; shows value of SSOT vs. message passing |
| AutoGen GroupChat | Default multi-agent conversation | Shows value of explicit harness and loop design |
| MAESTRO-SSOT (full) | Our complete system | Primary contribution |
| Ablations | Full system minus Harness / minus SSOT / minus auto-loop | Isolates contribution of each component |
6.3 Evaluation Metrics
6.3.1 Functional Correctness
- Pass@1: Task passes all tests on first completion
- Integration Success Rate: Percentage of multi-module tasks where all modules compile and interfaces align
- Contract Violation Rate: Number of instances where Agent A's code calls an interface that does not match Agent B's implementation
6.3.2 Efficiency
- Total Tokens Consumed: Cumulative LLM tokens across all agents
- Communication Rounds: Number of inter-agent message exchanges (for message-passing baselines) vs. SSOT read/write operations
- Wall-Clock Time: End-to-end execution time
- Steps to Completion: Number of tool calls / planning iterations
6.3.3 Safety & Autonomy
- Unsupervised Completion Rate: Percentage of tasks completed without human intervention
- Rollback Frequency: How often the harness triggers automatic reversion
- Budget Exceeded Rate: How often agents hit token/step/time limits
- Dangerous Action Blocked Rate: Harness validator rejections
6.3.4 Observability (User Study, Optional)
- Time for a human to diagnose a failure using SSOT vs. message logs
- Subjective confidence in system status
6.4 Expected Results
| Metric | Single-Agent | Agyn | MAESTRO-SSOT |
|---|---|---|---|
| Pass@1 (Contract-Bench) | 30–40% | 50–60% | 65–75% |
| Integration Success Rate | 40–50% | 60–70% | 80–90% |
| Contract Violation Rate | N/A (single module) | 15–25% | <5% |
| Total Tokens (per task) | Baseline | 1.5–2× baseline | 1.0–1.2× baseline |
| Unsupervised Completion | 80%* | 60% (fragile) | 85% |
* Single-agent unsupervised completion is high because the task scope is narrower; it often fails silently on integration rather than explicitly requesting help.
7. Innovation & Contributions
7.1 Primary Contributions
Single Source of Truth for Multi-Agent SE
- We are the first to propose and evaluate an explicit, structured, concurrently-accessible shared state as the primary coordination mechanism for LLM-based software engineering agents. This moves beyond message passing to a "shared workspace" paradigm inspired by human team practices.
Agent Harness: Policy-Based Guarded Execution
- We introduce a unified guard layer that combines sandboxing, ACLs, budgets, and automatic rollback into a cohesive policy enforcement framework. Unlike existing code sandboxes (which only isolate execution), our Harness mediates what agents are allowed to do to each other and the shared state.
Autonomous Control Loop with Managed Failure
- We design a deterministic loop that enables continuous autonomous execution while defining clear boundaries for when the system should retry, rollback, or escalate. This provides a principled answer to "how much autonomy is safe?"
7.2 Secondary Contributions
Contract-Bench: A Benchmark for Multi-Module Agent Collaboration
- We release a dataset of software engineering tasks that explicitly require cross-module coordination, filling a gap in existing benchmarks that focus on isolated bug fixes.
Open-Source Reference Implementation
- We will release our system as a modular extension to existing frameworks (initially standalone with clear adapter interfaces), enabling the community to adopt SSOT and Harness principles in their own agent systems.
8. Related Work & Positioning
8.1 Single-Agent Coding Systems
SWE-agent, Devin, OpenHands, and CodeAct represent the state of the art in single-agent software engineering. They excel at sequential reasoning and tool use but are not designed for parallel, multi-module development. Our work is complementary: any of these agents could serve as an Execution Agent within our framework.
8.2 Multi-Agent SE Systems
Agyn (2026) is the closest competitor. It demonstrates that role-based multi-agent teams outperform solo agents on SWE-bench. However, Agyn uses asynchronous message passing and does not explicitly model shared interfaces or provide runtime guardrails. Our SSOT and Harness layers address precisely these omissions.
FlowGen (2025) simulates classic software engineering methodologies (Scrum, Waterfall) with multi-agent workflows. While process-aware, FlowGen uses fixed, pre-defined workflows. Our Auto-Loop is adaptive: it replans based on validation failures and contract changes.
AutoGen and CrewAI provide general-purpose multi-agent orchestration but are not optimized for software engineering. Their group-chat models lack the structured state and safety guarantees we propose.
8.3 Shared State in Multi-Agent Systems
In classical multi-agent systems (pre-LLM), shared blackboards and tuple spaces have been studied extensively. However, these systems used symbolic reasoning, not LLMs, and did not face the unique challenges of LLM-based code generation (hallucinated interfaces, unsafe tool use, token budgets). Our SSOT is specifically designed for the generative, probabilistic, and tool-using nature of LLM agents.
8.4 Safety and Guardrails for Agents
Recent work on agent safety focuses on prompt injection detection and output filtering. Our Harness complements this by enforcing constraints at the action layer: what files can be touched, what commands can be run, and how much budget can be spent. This is closer to operating system security than input sanitization.
9. Timeline & Milestones
Phase 1: Foundation (Weeks 1–4)
- [ ] Finalize system architecture and SSOT schema
- [ ] Set up development environment (litellm, e2b, PydanticAI)
- [ ] Implement SSOT Hub core (Requirements Tree + Contract Registry)
- [ ] Build minimal Agent Harness (sandbox + basic ACL)
- Deliverable: Two agents can read/write to SSOT and execute code in sandbox
Phase 2: Core System (Weeks 5–10)
- [ ] Implement Auto-Loop Controller with all six states
- [ ] Build all agent roles (Planning, Backend, Frontend, Test, Review)
- [ ] Integrate Contract Validator (rule-based consistency checker)
- [ ] Implement rollback mechanism and retry budget logic
- [ ] Construct Contract-Bench v1 (20 tasks)
- Deliverable: End-to-end autonomous execution on simple tasks
Phase 3: Evaluation & Hardening (Weeks 11–18)
- [ ] Expand Contract-Bench to 80 tasks
- [ ] Implement all baselines (Single-Agent, Agyn if code available, AutoGen)
- [ ] Run full benchmark suite and collect metrics
- [ ] Implement ablation studies (no SSOT, no Harness, no auto-loop)
- [ ] Debug stability issues; harden error recovery
- Deliverable: Complete experimental results with statistical significance
Phase 4: Writing & Submission (Weeks 19–26)
- [ ] Write paper draft (Introduction, Related Work, Method, Experiments)
- [ ] Prepare artifact package (code + benchmark + documentation)
- [ ] Internal review and revision
- [ ] Submit to target conference (ICSE/FSE/ASE cycle)
- Deliverable: Submitted paper + open-source repository
Phase 5: Buffer & Revision (Weeks 27–36)
- [ ] Address reviewer feedback (if applicable)
- [ ] Extend system based on experimental insights
- [ ] Prepare camera-ready version and artifact evaluation
- Deliverable: Published paper
10. Risk Analysis & Mitigation
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Agyn code does not open-source | Medium | Medium | Pivot to AutoGen v0.4 as base; Agyn comparison becomes conceptual rather than empirical |
| SSOT concurrency causes deadlocks | Medium | High | Start with coarse-grained locks (per-domain); optimize later; deadlock detection with timeout |
| LLM inconsistency makes contracts unreliable | High | Medium | Hybrid extraction: AST parsing for signatures + LLM for semantic summaries; report extraction accuracy |
| Contract-Bench tasks too easy/too hard | Medium | Medium | Iterate task design with pilot runs; include difficulty grading; report per-difficulty results |
| Experiments show no improvement over Agyn | Low | High | Ensure benchmark emphasizes multi-module integration; if null result, pivot paper to diagnostic analysis of why SSOT doesn't help |
| e2b/litellm API changes | Low | Medium | Pin dependency versions; keep Docker-based fallback for sandbox |
11. Expected Outcomes
11.1 Academic
- One full-length research paper submitted to a top-tier SE or AI conference
- One benchmark dataset (Contract-Bench) released to the community
- Citable evidence on the efficacy of shared-state vs. message-passing coordination in LLM-based SE
11.2 Engineering
- Open-source reference implementation (~3000–4000 LOC) with clear documentation
- Reproducible experimental pipeline (one-command benchmark execution)
- Modular design enabling adoption in other agent frameworks
11.3 Broader Impact
- A principled framework for safe autonomous software development
- Design patterns (SSOT, Harness, Auto-Loop) applicable beyond code generation (e.g., data engineering, scientific computing, hardware design)
12. Appendix: Preliminary SSOT Schema
# requirements_tree.yaml
requirements:
- id: R0
description: "Implement a user authentication system"
status: IN_PROGRESS
children:
- id: R0.1
description: "Backend: JWT token generation and validation"
status: DONE
assignee: backend_agent
acceptance_criteria:
- "POST /auth/login returns valid JWT"
- "JWT contains user_id and exp claims"
dependencies: []
- id: R0.2
description: "Frontend: Login form and protected routes"
status: IN_PROGRESS
assignee: frontend_agent
acceptance_criteria:
- "Login form calls POST /auth/login"
- "401 redirects to /login"
dependencies: [R0.1]
# contract_registry.yaml
contracts:
- id: C1
name: "AuthAPI"
owner: backend_agent
consumers: [frontend_agent]
version: 2
status: STABLE
schema:
endpoints:
- path: "/auth/login"
method: POST
request: { email: string, password: string }
response: { token: string, expires_in: int }
- path: "/auth/verify"
method: GET
headers: { Authorization: "Bearer <token>" }
response: { user_id: string, valid: bool }
changelog:
- version: 2
change: "Added /auth/verify endpoint"
timestamp: "2025-04-23T10:00:00Z"
# execution_log.yaml
log:
- timestamp: "2025-04-23T10:05:00Z"
agent_id: backend_agent
action_type: CODE_WRITE
target: "src/api/auth.py"
status: SUCCESS
ssot_version: "abc123"
- timestamp: "2025-04-23T10:06:00Z"
agent_id: frontend_agent
action_type: CONTRACT_READ
target: "C1"
status: SUCCESS
note: "Consumed version 2 of AuthAPI"13. Appendix: Agent Harness Policy DSL
# harness_policy.yaml
agents:
backend_agent:
filesystem:
allow:
- "src/api/**"
- "tests/api/**"
- "migrations/**"
deny:
- "src/ui/**"
- "*.env"
commands:
allow: ["python", "pytest", "pip", "git"]
deny: ["rm -rf", "curl", "wget", "docker"]
budget:
max_tokens_per_task: 50000
max_steps_per_epoch: 30
max_wall_time_minutes: 10
rollback: true
human_gate:
- action_pattern: "DELETE > 50 lines"
requires_approval: true
- action_pattern: "MODIFY .env*"
requires_approval: true
frontend_agent:
filesystem:
allow:
- "src/ui/**"
- "tests/ui/**"
- "public/**"
deny:
- "src/api/**"
commands:
allow: ["npm", "npx", "vite", "jest", "git"]
deny: ["curl", "wget", "docker", "sudo"]
budget:
max_tokens_per_task: 40000
max_steps_per_epoch: 25
max_wall_time_minutes: 8
rollback: trueDocument prepared for the MAESTRO-SSOT project. Last updated: 2025-04-23.