MAESTRO-SSOT: Managed Autonomy in Multi-Agent Software Engineering via Single Source of Truth

Research Proposal (v1.0)
Research Direction: AI Agent / Multi-Agent Systems (Application Layer)
Target Venues: ICSE, FSE, ESEC/FSE, AAMAS, ASE, or top-tier SE/AI conferences
Estimated Timeline: 6–9 months to first submission

1. Project Overview

1.1 Title

MAESTRO-SSOT: Managed Autonomy in Multi-Agent Software Engineering via Single Source of Truth and Guarded Execution

1.2 One-Sentence Summary

We propose a multi-agent software engineering system where specialized LLM-based agents collaborate around a shared, structured state (Single Source of Truth) and execute within a restrictive harness, enabling continuous autonomous development loops without human intervention while maintaining strict safety and observability guarantees.

1.3 Core Hypothesis

Explicit shared-state mediation (SSOT) combined with runtime guardrails (Agent Harness) enables multi-agent software engineering systems to achieve higher integration success rates and lower communication overhead than message-passing or pipeline-based alternatives, while remaining safe enough for unsupervised autonomous execution.

2. Background & Motivation

2.1 The Rise of Agentic Software Engineering

Recent advances in LLM-based coding agents have demonstrated impressive capabilities on individual software engineering tasks. Systems such as SWE-agent, Devin, and OpenHands have achieved over 70% accuracy on SWE-bench Verified by decomposing issue resolution into iterative planning, tool use, and execution cycles.

However, these systems largely preserve a single-agent or loosely delegated execution model. Even recent multi-agent proposals (e.g., Agyn, AutoGen-based code teams) treat collaboration as message passing between isolated agents: tasks are dispatched, executed independently, and results are merged post-hoc.

2.2 The Hidden Cost of "Divide and Conquer"

Real-world software development is not merely task decomposition—it is a continuous process of negotiating shared understanding:

A backend developer changes an API schema; the frontend developer must adapt
A requirement engineer clarifies a constraint; architects and implementers must reconcile their mental models
A test failure reveals an integration mismatch that spans multiple modules

Current multi-agent systems lack an explicit mechanism for maintaining consensus on evolving shared artifacts. Agents operate on private contexts, leading to:

Integration failures: Agent A generates code assuming interface X; Agent B generates code assuming interface Y
Redundant communication: Agents must repeatedly query each other to synchronize state
Opacity: Humans cannot easily observe what each agent believes the current specification to be
Unsafe autonomy: Without guardrails, an autonomous agent loop may waste tokens, corrupt files, or execute dangerous operations

2.3 Opportunity: From Message Passing to Shared State

We observe that human software teams solve these problems through shared artifacts: Git repositories, API specifications, project management boards, and design documents. These act as a single source of truth (SSOT) that all participants read from and write to.

We hypothesize that endowing multi-agent systems with an analogous explicit SSOT layer—combined with a restrictive execution harness—will yield substantial improvements in integration reliability, communication efficiency, and autonomous execution safety.

3. Problem Definition

3.1 Formal Problem Statement

Given a natural-language software requirement (R), a set of specialized agents (\mathcal{A} = {A_1, A_2, \ldots, A_n}), and a codebase (C), construct a system that:

Maintains a shared structured state (S) (the SSOT) representing requirements, design contracts, execution plans, and current progress
Allows agents to atomically read from and write to (S) under concurrency control
Executes agent actions within a restricted harness (H) that enforces safety policies
Runs an autonomous control loop that dispatches agents, validates outcomes, and terminates only when (R) is satisfied or an unrecoverable failure occurs

3.2 Key Research Questions

RQ1 (State Mediation): Does an explicit SSOT reduce integration failures and communication overhead compared to message-passing multi-agent architectures?

RQ2 (Safety): Can an Agent Harness provide sufficient isolation and guardrails to enable safe unsupervised autonomous execution without sacrificing task completion rates?

RQ3 (Autonomy Spectrum): What is the optimal trade-off between agent autonomy (continuous execution without human approval) and safety (restrictive guardrails)?

RQ4 (Scalability): How does the SSOT architecture scale with the number of agents and the complexity of the shared state?

4. Proposed System: MAESTRO-SSOT

4.1 Design Principles

SSOT as a First-Class Citizen: The shared state is not a log or message bus—it is a structured, queryable, versioned artifact that agents actively manipulate
Agent Harness as Policy Enforcement: Safety is not an afterthought; it is enforced at the action layer via explicit allowlists, budgets, and rollback triggers
Set-and-Forget Autonomy: Once a human provides a requirement, the system runs until completion, failure, or an unambiguous escalation condition
Observability by Design: The SSOT serves as a live dashboard for human supervisors

4.2 System Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                         Human Supervisor                            │
│                    (injects requirements, monitors)                 │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│                         SSOT Hub (Shared State)                     │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌───────────┐  │
│  │ Requirements │ │   Contract   │ │   Execution  │ │  Agent    │  │
│  │    Tree      │ │   Registry   │ │     Log      │ │  Memory   │  │
│  └──────────────┘ └──────────────┘ └──────────────┘ └───────────┘  │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │                    SSOT Access Control                        │   │
│  │   (read/write locks, versioning, conflict detection)          │   │
│  └──────────────────────────────────────────────────────────────┘   │
└──────────────────────────────┬──────────────────────────────────────┘
                               │ publish / subscribe / query
        ┌──────────────────────┼──────────────────────┐
        ▼                      ▼                      ▼
┌──────────────┐      ┌──────────────┐      ┌──────────────┐
│   Planning   │      │  Execution   │      │   Review     │
│    Agent     │◄────►│    Agents    │◄────►│   Agents     │
│              │      │ (Backend,    │      │              │
│  decomposes  │      │  Frontend,   │      │  validates   │
│  requirements│      │  Test, etc.) │      │  contracts   │
└──────────────┘      └──────┬───────┘      └──────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      Agent Harness (Guard Layer)                    │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐   │
│  │   Sandbox   │ │  Permission │ │   Budget    │ │   Rollback  │   │
│  │  (isolated  │ │  (ACL for   │ │  (token &   │ │  (undo on   │   │
│  │   execution)│ │  files/APIs)│ │  step caps) │ │   failure)  │   │
│  └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘   │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │              Action Validator (pre-execution)                 │   │
│  └──────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     Auto-Loop Controller                            │
│   while not done:                                                   │
│     1. PLAN   ──► decompose requirement via Planning Agent          │
│     2. ASSIGN ──► dispatch subtasks to Execution Agents             │
│     3. EXEC   ──► run actions through Harness                       │
│     4. VALID  ──► Review Agents check contracts & tests             │
│     5. COMMIT ──► update SSOT; on failure, rollback & retry         │
│     6. CHECK  ──► verify requirement satisfaction; loop or exit     │
└─────────────────────────────────────────────────────────────────────┘

4.3 The SSOT Hub: Detailed Design

The SSOT is not a key-value store. It is a semantically structured workspace with four primary domains:

4.3.1 Requirements Tree

Hierarchical decomposition of the top-level requirement (R)
Each node: (id, description, status, assignee, acceptance_criteria, dependencies)
Status enum: PENDING | IN_PROGRESS | BLOCKED | REVIEW | DONE | FAILED
Agents claim nodes, update status, and report completion

4.3.2 Contract Registry

Explicit interface contracts between agents/modules
Each contract: (contract_id, owner_module, consumer_modules, schema, version, status)
Status enum: DRAFT | STABLE | DEPRECATED | BROKEN
Critical invariant: Before an agent modifies an exported interface, it must update the contract. Consumers automatically detect version changes.

4.3.3 Execution Log

Append-only record of all agent actions and their outcomes
Entry: (timestamp, agent_id, action_type, target, input_hash, output_hash, status, error_if_any)
Enables full reproducibility and forensic analysis

4.3.4 Agent Working Memory

Short-term context that agents persist across steps (e.g., research notes, partial solutions)
Distinct from execution log: this is the agent's "scratchpad," not the system's audit trail

4.3.5 SSOT Access Control & Concurrency

Read: Any agent may query any domain (encourages transparency)
Write: Agents acquire locks on the specific sub-tree or contract they intend to modify
Conflict Detection: If two agents write to overlapping contracts within the same planning epoch, the second write is rejected and the agent is notified to reconcile
Versioning: All domains support atomic snapshots, enabling rollback to any prior state

4.4 Agent Harness: Detailed Design

The Harness wraps every agent action. It is not merely a Docker sandbox—it is a policy-aware action interceptor.

4.4.1 Sandbox Layer

Agent code execution runs in isolated containers (Docker or e2b.dev cloud sandboxes)
Network egress restricted to allowlisted domains (e.g., PyPI, npm registry)
File system access limited to the project workspace + temporary directories

4.4.2 Permission Layer (ACL)

Per-agent file access rules: backend_agent may write to src/api/ and tests/api/, but not src/ui/
Per-agent tool allowlists: frontend_agent may call npm, vite, jest, but not docker or rm -rf
Per-agent API rules: restrictions on external service calls (e.g., no production database connections)

4.4.3 Budget Layer

Token Budget: Max cumulative tokens per agent per task (prevents infinite LLM loops)
Step Budget: Max number of tool calls per agent per epoch (prevents runaway execution)
Time Budget: Max wall-clock time per subtask (prevents hanging processes)
When a budget is exhausted, the agent is paused and the Auto-Loop either retries with a different strategy or escalates to the human

4.4.4 Rollback Layer

Before any agent writes to the codebase, the Harness creates a snapshot (Git commit or overlayfs checkpoint)
If subsequent validation (tests, contract checks) fails, the system automatically reverts the snapshot
The failed attempt is logged in the Execution Log for learning/avoidance

4.4.5 Action Validator (Pre-Execution)

Before an action reaches the sandbox, it passes through a fast rule-based validator:

Does this action violate any ACL rule?
Does this action modify a file outside the agent's assigned scope?
Does this action delete more than N lines without explicit confirmation?
Does this action match a known dangerous pattern (e.g., eval(), os.system() with user input)?

If validation fails, the action is rejected immediately with an explanatory error returned to the agent.

4.5 Auto-Loop Controller: Detailed Design

The Auto-Loop is the orchestration brain. It is not an LLM itself; it is a deterministic state machine with LLM-powered decision points.

python

class AutoLoop:
    STATE: Enum = PLAN | ASSIGN | EXEC | VALID | COMMIT | CHECK | DONE | FAIL

    def run(self, requirement: str):
        ssot.initialize(requirement)
        while self.state not in (DONE, FAIL):
            match self.state:
                case PLAN:
                    plan = planning_agent.decompose(ssot.requirements)
                    ssot.requirements_tree.commit(plan)
                    self.state = ASSIGN
                case ASSIGN:
                    for task in ssot.requirements_tree.pending():
                        agent = scheduler.select_best_agent(task)
                        agent.claim(task)
                    self.state = EXEC
                case EXEC:
                    for agent in ssot.active_agents():
                        result = harness.run(agent, agent.next_action())
                        ssot.execution_log.append(result)
                    self.state = VALID
                case VALID:
                    review_results = [review_agent.check() for review_agent in reviewers]
                    contracts_valid = contract_validator.check_all()
                    tests_pass = test_runner.run_all()
                    if all(review_results) and contracts_valid and tests_pass:
                        self.state = COMMIT
                    else:
                        harness.rollback_all()
                        self.state = PLAN  # retry with feedback
                case COMMIT:
                    ssot.snapshot()  # tag a clean state
                    self.state = CHECK
                case CHECK:
                    if requirement_satisfier.verify(ssot, requirement):
                        self.state = DONE
                    elif ssot.retry_budget_exhausted():
                        self.state = FAIL
                    else:
                        self.state = PLAN  # refine remaining work

Key Design Decisions:

The loop is deterministic except for LLM calls inside agents; this makes debugging and reproducibility tractable
Rollback is epoch-level, not single-action: if any validation fails, the entire epoch is reverted, ensuring the SSOT and codebase remain consistent
Retry budget: The system is allowed N full-cycle retries before declaring failure and notifying the human

4.6 Agent Roles

We propose a minimal but sufficient set of agent roles:

Role	Responsibility	SSOT Read Access	SSOT Write Access
Planning Agent	Decompose requirements, resolve ambiguities, detect contradictions	All	Requirements Tree
Backend Agent	Implement server-side logic, APIs, database schemas	All	Contract Registry (API contracts), Codebase (`src/api/`)
Frontend Agent	Implement UI components, client-side logic	All	Codebase (`src/ui/`), Contract Registry (consumes API contracts)
Test Agent	Write and run unit/integration tests, report coverage	All	Codebase (`tests/`), Execution Log
Review Agent	Check contract compliance, code quality, security issues	All	Execution Log (annotations)
Contract Validator	(Non-LLM rule engine) Verify that all consumers match their declared contract versions	Contract Registry	Contract Registry (status updates)

5. Technical Implementation Strategy

5.1 Build vs. Reuse

We adopt a "thin skeleton + core novelty" strategy. We do not rebuild general-purpose agent infrastructure; we build the SSOT, Harness, and Auto-Loop layers atop proven foundations.

Component	Reuse Decision	Rationale
LLM API abstraction	litellm	Unified interface for OpenAI, Anthropic, local models; 1-line provider switching
Agent base abstraction	PydanticAI (or AutoGen v0.4 Core)	Lightweight structured agent framework; supports typed tool calls and structured outputs
Code sandbox	e2b.dev (or Docker SDK)	Production-grade isolated execution without maintaining our own container infra
Git operations	GitPython	Battle-tested library for programmatic repo manipulation
AST parsing / code analysis	tree-sitter + Jedi	Extract API signatures for contract registry without writing custom parsers
Test execution	pytest / jest CLI wrappers	No reinvention needed

5.2 Self-Built Components (Novelty Layer)

These are the components we implement from scratch and constitute the paper's core contribution:

ssot/ — The shared state layer (~800–1200 LOC)
- Domain models: RequirementsTree, ContractRegistry, ExecutionLog, AgentMemory
- Access control: LockManager, VersionManager, ConflictDetector
- Persistence: JSON/YAML snapshotting with Git-backed versioning
harness/ — The guard layer (~600–1000 LOC)
- Sandbox wrapper with ACL enforcement
- Budget tracker (token, step, time)
- Rollback orchestrator (Git snapshots or overlayfs)
- Action validator (rule-based pre-filter)
loop/ — The autonomous controller (~400–600 LOC)
- Deterministic state machine
- Scheduler for agent-task assignment
- Retry budget manager
- Human escalation gateway
agents/ — Role-specific agents (~600–1000 LOC)
- Prompt templates and tool bindings for each role
- SSOT read/write integration

Total estimated codebase for paper: 2500–4000 lines of Python.

6. Experimental Design

6.1 Datasets & Benchmarks

We design experiments around two complementary evaluation sets:

6.1.1 Contract-Bench (Self-Constructed, ~80 tasks)

A novel benchmark specifically designed to test multi-module collaboration. Each task requires at least two agents to produce code that integrates correctly.

Example tasks:

"Implement a REST API with JWT authentication and a React frontend that uses it. Include login/logout flows and protected route handling."
"Build a Python CLI tool that reads from a SQLite database and exports CSV. The database schema must be created by a separate module and shared via a typed interface."
"Implement a real-time chat WebSocket server and a simple HTML client. Messages must be persisted to Redis and broadcast to all connected clients."

Evaluation criteria per task:

Compilation / dependency resolution success
All integration tests pass
Contract consistency verified (no caller invokes a non-existent endpoint)
Code quality (linting, type checking)

6.1.2 SWE-bench Multi (Filtered Subset, ~50 tasks)

We filter SWE-bench Pro / SWE-EVO to retain only issues that:

Touch ≥3 source files
Involve cross-module changes (e.g., API change + consumer update)
Have clear test criteria

This subset specifically rewards systems that can coordinate changes across modules.

6.2 Baselines

Baseline	Description	Why Compare
Single-Agent	SWE-agent style: one agent handles the entire task	Establishes whether multi-agent is beneficial at all
Agyn	State-of-the-art multi-agent SE (role-based pipeline, message passing)	Direct competitor; shows value of SSOT vs. message passing
AutoGen GroupChat	Default multi-agent conversation	Shows value of explicit harness and loop design
MAESTRO-SSOT (full)	Our complete system	Primary contribution
Ablations	Full system minus Harness / minus SSOT / minus auto-loop	Isolates contribution of each component

6.3 Evaluation Metrics

6.3.1 Functional Correctness

Pass@1: Task passes all tests on first completion
Integration Success Rate: Percentage of multi-module tasks where all modules compile and interfaces align
Contract Violation Rate: Number of instances where Agent A's code calls an interface that does not match Agent B's implementation

6.3.2 Efficiency

Total Tokens Consumed: Cumulative LLM tokens across all agents
Communication Rounds: Number of inter-agent message exchanges (for message-passing baselines) vs. SSOT read/write operations
Wall-Clock Time: End-to-end execution time
Steps to Completion: Number of tool calls / planning iterations

6.3.3 Safety & Autonomy

Unsupervised Completion Rate: Percentage of tasks completed without human intervention
Rollback Frequency: How often the harness triggers automatic reversion
Budget Exceeded Rate: How often agents hit token/step/time limits
Dangerous Action Blocked Rate: Harness validator rejections

6.3.4 Observability (User Study, Optional)

Time for a human to diagnose a failure using SSOT vs. message logs
Subjective confidence in system status

6.4 Expected Results

Metric	Single-Agent	Agyn	MAESTRO-SSOT
Pass@1 (Contract-Bench)	30–40%	50–60%	65–75%
Integration Success Rate	40–50%	60–70%	80–90%
Contract Violation Rate	N/A (single module)	15–25%	<5%
Total Tokens (per task)	Baseline	1.5–2× baseline	1.0–1.2× baseline
Unsupervised Completion	80%*	60% (fragile)	85%

* Single-agent unsupervised completion is high because the task scope is narrower; it often fails silently on integration rather than explicitly requesting help.

7. Innovation & Contributions

7.1 Primary Contributions

Single Source of Truth for Multi-Agent SE
- We are the first to propose and evaluate an explicit, structured, concurrently-accessible shared state as the primary coordination mechanism for LLM-based software engineering agents. This moves beyond message passing to a "shared workspace" paradigm inspired by human team practices.
Agent Harness: Policy-Based Guarded Execution
- We introduce a unified guard layer that combines sandboxing, ACLs, budgets, and automatic rollback into a cohesive policy enforcement framework. Unlike existing code sandboxes (which only isolate execution), our Harness mediates what agents are allowed to do to each other and the shared state.
Autonomous Control Loop with Managed Failure
- We design a deterministic loop that enables continuous autonomous execution while defining clear boundaries for when the system should retry, rollback, or escalate. This provides a principled answer to "how much autonomy is safe?"

7.2 Secondary Contributions

Contract-Bench: A Benchmark for Multi-Module Agent Collaboration
- We release a dataset of software engineering tasks that explicitly require cross-module coordination, filling a gap in existing benchmarks that focus on isolated bug fixes.
Open-Source Reference Implementation
- We will release our system as a modular extension to existing frameworks (initially standalone with clear adapter interfaces), enabling the community to adopt SSOT and Harness principles in their own agent systems.

8.1 Single-Agent Coding Systems

SWE-agent, Devin, OpenHands, and CodeAct represent the state of the art in single-agent software engineering. They excel at sequential reasoning and tool use but are not designed for parallel, multi-module development. Our work is complementary: any of these agents could serve as an Execution Agent within our framework.

8.2 Multi-Agent SE Systems

Agyn (2026) is the closest competitor. It demonstrates that role-based multi-agent teams outperform solo agents on SWE-bench. However, Agyn uses asynchronous message passing and does not explicitly model shared interfaces or provide runtime guardrails. Our SSOT and Harness layers address precisely these omissions.

FlowGen (2025) simulates classic software engineering methodologies (Scrum, Waterfall) with multi-agent workflows. While process-aware, FlowGen uses fixed, pre-defined workflows. Our Auto-Loop is adaptive: it replans based on validation failures and contract changes.

AutoGen and CrewAI provide general-purpose multi-agent orchestration but are not optimized for software engineering. Their group-chat models lack the structured state and safety guarantees we propose.

8.3 Shared State in Multi-Agent Systems

In classical multi-agent systems (pre-LLM), shared blackboards and tuple spaces have been studied extensively. However, these systems used symbolic reasoning, not LLMs, and did not face the unique challenges of LLM-based code generation (hallucinated interfaces, unsafe tool use, token budgets). Our SSOT is specifically designed for the generative, probabilistic, and tool-using nature of LLM agents.

8.4 Safety and Guardrails for Agents

Recent work on agent safety focuses on prompt injection detection and output filtering. Our Harness complements this by enforcing constraints at the action layer: what files can be touched, what commands can be run, and how much budget can be spent. This is closer to operating system security than input sanitization.

9. Timeline & Milestones

Phase 1: Foundation (Weeks 1–4)

[ ] Finalize system architecture and SSOT schema
[ ] Set up development environment (litellm, e2b, PydanticAI)
[ ] Implement SSOT Hub core (Requirements Tree + Contract Registry)
[ ] Build minimal Agent Harness (sandbox + basic ACL)
Deliverable: Two agents can read/write to SSOT and execute code in sandbox

Phase 2: Core System (Weeks 5–10)

[ ] Implement Auto-Loop Controller with all six states
[ ] Build all agent roles (Planning, Backend, Frontend, Test, Review)
[ ] Integrate Contract Validator (rule-based consistency checker)
[ ] Implement rollback mechanism and retry budget logic
[ ] Construct Contract-Bench v1 (20 tasks)
Deliverable: End-to-end autonomous execution on simple tasks

Phase 3: Evaluation & Hardening (Weeks 11–18)

[ ] Expand Contract-Bench to 80 tasks
[ ] Implement all baselines (Single-Agent, Agyn if code available, AutoGen)
[ ] Run full benchmark suite and collect metrics
[ ] Implement ablation studies (no SSOT, no Harness, no auto-loop)
[ ] Debug stability issues; harden error recovery
Deliverable: Complete experimental results with statistical significance

Phase 4: Writing & Submission (Weeks 19–26)

[ ] Write paper draft (Introduction, Related Work, Method, Experiments)
[ ] Prepare artifact package (code + benchmark + documentation)
[ ] Internal review and revision
[ ] Submit to target conference (ICSE/FSE/ASE cycle)
Deliverable: Submitted paper + open-source repository

Phase 5: Buffer & Revision (Weeks 27–36)

[ ] Address reviewer feedback (if applicable)
[ ] Extend system based on experimental insights
[ ] Prepare camera-ready version and artifact evaluation
Deliverable: Published paper

10. Risk Analysis & Mitigation

Risk	Likelihood	Impact	Mitigation
Agyn code does not open-source	Medium	Medium	Pivot to AutoGen v0.4 as base; Agyn comparison becomes conceptual rather than empirical
SSOT concurrency causes deadlocks	Medium	High	Start with coarse-grained locks (per-domain); optimize later; deadlock detection with timeout
LLM inconsistency makes contracts unreliable	High	Medium	Hybrid extraction: AST parsing for signatures + LLM for semantic summaries; report extraction accuracy
Contract-Bench tasks too easy/too hard	Medium	Medium	Iterate task design with pilot runs; include difficulty grading; report per-difficulty results
Experiments show no improvement over Agyn	Low	High	Ensure benchmark emphasizes multi-module integration; if null result, pivot paper to diagnostic analysis of why SSOT doesn't help
e2b/litellm API changes	Low	Medium	Pin dependency versions; keep Docker-based fallback for sandbox

11. Expected Outcomes

11.1 Academic

One full-length research paper submitted to a top-tier SE or AI conference
One benchmark dataset (Contract-Bench) released to the community
Citable evidence on the efficacy of shared-state vs. message-passing coordination in LLM-based SE

11.2 Engineering

Open-source reference implementation (~3000–4000 LOC) with clear documentation
Reproducible experimental pipeline (one-command benchmark execution)
Modular design enabling adoption in other agent frameworks

11.3 Broader Impact

A principled framework for safe autonomous software development
Design patterns (SSOT, Harness, Auto-Loop) applicable beyond code generation (e.g., data engineering, scientific computing, hardware design)

12. Appendix: Preliminary SSOT Schema

yaml

# requirements_tree.yaml
requirements:
  - id: R0
    description: "Implement a user authentication system"
    status: IN_PROGRESS
    children:
      - id: R0.1
        description: "Backend: JWT token generation and validation"
        status: DONE
        assignee: backend_agent
        acceptance_criteria:
          - "POST /auth/login returns valid JWT"
          - "JWT contains user_id and exp claims"
        dependencies: []
      - id: R0.2
        description: "Frontend: Login form and protected routes"
        status: IN_PROGRESS
        assignee: frontend_agent
        acceptance_criteria:
          - "Login form calls POST /auth/login"
          - "401 redirects to /login"
        dependencies: [R0.1]

# contract_registry.yaml
contracts:
  - id: C1
    name: "AuthAPI"
    owner: backend_agent
    consumers: [frontend_agent]
    version: 2
    status: STABLE
    schema:
      endpoints:
        - path: "/auth/login"
          method: POST
          request: { email: string, password: string }
          response: { token: string, expires_in: int }
        - path: "/auth/verify"
          method: GET
          headers: { Authorization: "Bearer <token>" }
          response: { user_id: string, valid: bool }
    changelog:
      - version: 2
        change: "Added /auth/verify endpoint"
        timestamp: "2025-04-23T10:00:00Z"

# execution_log.yaml
log:
  - timestamp: "2025-04-23T10:05:00Z"
    agent_id: backend_agent
    action_type: CODE_WRITE
    target: "src/api/auth.py"
    status: SUCCESS
    ssot_version: "abc123"
  - timestamp: "2025-04-23T10:06:00Z"
    agent_id: frontend_agent
    action_type: CONTRACT_READ
    target: "C1"
    status: SUCCESS
    note: "Consumed version 2 of AuthAPI"

13. Appendix: Agent Harness Policy DSL

yaml

# harness_policy.yaml
agents:
  backend_agent:
    filesystem:
      allow:
        - "src/api/**"
        - "tests/api/**"
        - "migrations/**"
      deny:
        - "src/ui/**"
        - "*.env"
    commands:
      allow: ["python", "pytest", "pip", "git"]
      deny: ["rm -rf", "curl", "wget", "docker"]
    budget:
      max_tokens_per_task: 50000
      max_steps_per_epoch: 30
      max_wall_time_minutes: 10
    rollback: true
    human_gate:
      - action_pattern: "DELETE > 50 lines"
        requires_approval: true
      - action_pattern: "MODIFY .env*"
        requires_approval: true

  frontend_agent:
    filesystem:
      allow:
        - "src/ui/**"
        - "tests/ui/**"
        - "public/**"
      deny:
        - "src/api/**"
    commands:
      allow: ["npm", "npx", "vite", "jest", "git"]
      deny: ["curl", "wget", "docker", "sudo"]
    budget:
      max_tokens_per_task: 40000
      max_steps_per_epoch: 25
      max_wall_time_minutes: 8
    rollback: true

Document prepared for the MAESTRO-SSOT project. Last updated: 2025-04-23.

MAESTRO-SSOT: Managed Autonomy in Multi-Agent Software Engineering via Single Source of Truth ​

1. Project Overview ​

1.1 Title ​

1.2 One-Sentence Summary ​

1.3 Core Hypothesis ​

2. Background & Motivation ​

2.1 The Rise of Agentic Software Engineering ​

2.2 The Hidden Cost of "Divide and Conquer" ​

2.3 Opportunity: From Message Passing to Shared State ​

3. Problem Definition ​

3.1 Formal Problem Statement ​

3.2 Key Research Questions ​

4. Proposed System: MAESTRO-SSOT ​

4.1 Design Principles ​

4.2 System Architecture ​

4.3 The SSOT Hub: Detailed Design ​

4.3.1 Requirements Tree ​

4.3.2 Contract Registry ​

4.3.3 Execution Log ​

4.3.4 Agent Working Memory ​

4.3.5 SSOT Access Control & Concurrency ​

4.4 Agent Harness: Detailed Design ​

4.4.1 Sandbox Layer ​

4.4.2 Permission Layer (ACL) ​

4.4.3 Budget Layer ​

4.4.4 Rollback Layer ​

4.4.5 Action Validator (Pre-Execution) ​

4.5 Auto-Loop Controller: Detailed Design ​

4.6 Agent Roles ​

5. Technical Implementation Strategy ​

5.1 Build vs. Reuse ​

5.2 Self-Built Components (Novelty Layer) ​

6. Experimental Design ​

6.1 Datasets & Benchmarks ​

6.1.1 Contract-Bench (Self-Constructed, ~80 tasks) ​

6.1.2 SWE-bench Multi (Filtered Subset, ~50 tasks) ​

6.2 Baselines ​

6.3 Evaluation Metrics ​

6.3.1 Functional Correctness ​

6.3.2 Efficiency ​

6.3.3 Safety & Autonomy ​

6.3.4 Observability (User Study, Optional) ​

6.4 Expected Results ​

7. Innovation & Contributions ​

7.1 Primary Contributions ​

7.2 Secondary Contributions ​

8. Related Work & Positioning ​

8.1 Single-Agent Coding Systems ​

8.2 Multi-Agent SE Systems ​

8.3 Shared State in Multi-Agent Systems ​

8.4 Safety and Guardrails for Agents ​

9. Timeline & Milestones ​

Phase 1: Foundation (Weeks 1–4) ​

Phase 2: Core System (Weeks 5–10) ​

Phase 3: Evaluation & Hardening (Weeks 11–18) ​

Phase 4: Writing & Submission (Weeks 19–26) ​

Phase 5: Buffer & Revision (Weeks 27–36) ​

10. Risk Analysis & Mitigation ​

11. Expected Outcomes ​

11.1 Academic ​

11.2 Engineering ​

11.3 Broader Impact ​

12. Appendix: Preliminary SSOT Schema ​

13. Appendix: Agent Harness Policy DSL ​