MAESTRO-SSOT 开发计划

Context

基于研究提案（docs/rp.md），开发一个多智能体软件工程框架。核心创新：SSOT共享状态替代消息传递、Agent Harness防护层、Auto-Loop自主循环。目标是发表论文并开源参考实现。

关键决策：

智能体框架：PydanticAI（类型安全、pydantic-graph状态机、TestModel测试支持）
持久化：SQLite运行时存储 + YAML快照导出
包管理：uv workspaces monorepo
总规模：2500-4000行Python

Monorepo 目录结构

maestro-ssot/
├── pyproject.toml              # workspace root，CLI入口
├── uv.lock
├── README.md
├── CLAUDE.md
├── LICENSE
├── .github/workflows/
│   └── ci.yml                  # lint + type-check + test
├── docs/
│   ├── rp.md                   # 已有：研究提案
│   ├── rp-zh.md                # 已有：中文翻译
│   ├── architecture.md         # 系统架构文档
│   └── api.md                  # API参考
├── configs/
│   ├── harness_policy.yaml     # Agent Harness策略配置
│   ├── ssot_schema.yaml        # SSOT域模式定义
│   └── agent_roles.yaml        # Agent角色声明
├── benchmarks/
│   └── contract_bench/         # Contract-Bench评估任务
│       ├── tasks/
│       └── evaluator.py
├── libs/
│   ├── maestro-ssot/           # SSOT Hub共享状态层
│   │   ├── pyproject.toml
│   │   ├── tests/
│   │   │   ├── conftest.py
│   │   │   ├── test_models.py
│   │   │   ├── test_requirements_tree.py
│   │   │   ├── test_contract_registry.py
│   │   │   ├── test_execution_log.py
│   │   │   ├── test_agent_memory.py
│   │   │   ├── test_access_control.py
│   │   │   └── test_versioning.py
│   │   └── src/maestro_ssot/
│   │       ├── __init__.py
│   │       ├── hub.py          # SSOTHub门面类
│   │       ├── models.py       # Pydantic域模型
│   │       ├── requirements.py # 需求树
│   │       ├── contracts.py    # 契约注册表
│   │       ├── execution_log.py# 执行日志
│   │       ├── memory.py       # 智能体工作记忆
│   │       ├── access.py       # LockManager, ConflictDetector
│   │       ├── versioning.py   # VersionManager
│   │       ├── persistence.py  # SQLite持久化
│   │       └── snapshot.py     # YAML/JSON导入导出
│   ├── maestro-harness/        # Agent Harness防护层
│   │   ├── pyproject.toml
│   │   ├── tests/
│   │   │   ├── conftest.py
│   │   │   ├── test_acl.py
│   │   │   ├── test_budget.py
│   │   │   ├── test_validator.py
│   │   │   ├── test_sandbox.py
│   │   │   └── test_harness.py
│   │   └── src/maestro_harness/
│   │       ├── __init__.py
│   │       ├── harness.py      # Harness门面
│   │       ├── acl.py          # 文件/命令ACL
│   │       ├── budget.py       # 令牌/步骤/时间预算
│   │       ├── validator.py    # 操作验证器
│   │       ├── sandbox.py      # 沙箱抽象(Local/Docker/e2b)
│   │       └── policy.py       # YAML策略加载
│   ├── maestro-loop/           # Auto-Loop自主控制器
│   │   ├── pyproject.toml
│   │   ├── tests/
│   │   │   ├── conftest.py
│   │   │   ├── test_state_machine.py
│   │   │   ├── test_scheduler.py
│   │   │   └── test_controller.py
│   │   └── src/maestro_loop/
│   │       ├── __init__.py
│   │       ├── controller.py   # AutoLoop状态机(pydantic-graph)
│   │       ├── states.py       # 状态枚举与转换逻辑
│   │       ├── scheduler.py    # 智能体-任务分配
│   │       └── retry.py        # 重试预算管理
│   └── maestro-agents/         # 角色智能体
│       ├── pyproject.toml
│       ├── tests/
│       │   ├── conftest.py
│       │   ├── test_agent_base.py
│       │   ├── test_planning_agent.py
│       │   ├── test_backend_agent.py
│       │   ├── test_frontend_agent.py
│       │   ├── test_test_agent.py
│       │   ├── test_review_agent.py
│       │   └── test_contract_validator.py
│       └── src/maestro_agents/
│           ├── __init__.py
│           ├── base.py         # MaestroAgent基类
│           ├── tools.py        # SSOT读/写工具
│           ├── planning.py     # PlanningAgent
│           ├── backend.py      # BackendAgent
│           ├── frontend.py     # FrontendAgent
│           ├── test_agent.py   # TestAgent
│           ├── review.py       # ReviewAgent
│           └── contract_validator.py  # 非LLM契约验证器
├── src/maestro/                # 根应用包
│   ├── __init__.py
│   ├── cli.py                  # CLI入口(typer)
│   └── demo.py                 # 演示脚本
└── tests/
    ├── conftest.py
    ├── test_integration.py     # 端到端集成测试
    └── test_demo.py

包依赖关系（DAG）

maestro-ssot      (独立，无内部依赖)
maestro-harness   (独立，无内部依赖)
maestro-agents    --> maestro-ssot, maestro-harness
maestro-loop      --> maestro-ssot, maestro-harness, maestro-agents
根应用             --> 所有四个包

Phase 1：基础（第1-4周）

目标：项目搭建 + SSOT核心 + 最小Harness + 两个Agent可读写SSOT

Week 1：仓库初始化 + SSOT模型

[x] 初始化uv workspace monorepo结构
- 根 pyproject.toml 定义workspace，dev依赖（pytest, ruff, mypy, pyright）
- 四个lib包各自的 pyproject.toml
- uv sync 验证所有包解析成功
[x] 实现SSOT域模型 libs/maestro-ssot/src/maestro_ssot/models.py
- RequirementNode: id, description, status(枚举), assignee, acceptance_criteria, dependencies, children
- Contract: contract_id, name, owner, consumers, version, status(枚举), schema, changelog
- ExecutionLogEntry: timestamp, agent_id, action_type, target, input_hash, output_hash, status, error
- AgentMemoryEntry: agent_id, key, value, timestamp
- SSOTVersion: version_id, timestamp, snapshot_hash, description
[x] 实现SQLite持久化 libs/maestro-ssot/src/maestro_ssot/persistence.py
- 五张表（requirements, contracts, execution_log, agent_memory, ssot_versions）
- WAL模式、CRUD操作、顺序迁移
[x] 单元测试：模型验证、持久化CRUD（:memory:数据库）

Week 2：SSOT域逻辑 + 访问控制

[x] 实现 requirements.py：add_node, update_status, claim_node, get_pending, 依赖阻塞逻辑
[x] 实现 contracts.py：register, update_schema, version bump, 状态转换
[x] 实现 execution_log.py：仅追加append, 按agent/时间/类型查询
[x] 实现 memory.py：read/write/delete, key列表
[x] 实现 access.py：LockManager(acquire/release/auto-expire), ConflictDetector
[x] 实现 versioning.py + snapshot.py：快照创建/恢复, YAML导出/导入
[x] 实现 hub.py：SSOTHub门面，组合所有域，写操作前强制访问控制
[x] 单元测试：并发锁、依赖阻塞、版本round-trip、YAML导出导入

Week 3：最小Harness + Agent基类

[x] 实现 policy.py：解析harness_policy.yaml为Pydantic模型
[x] 实现 acl.py：glob路径匹配、命令前缀匹配、deny优先
[x] 实现 budget.py：token/step/time跟踪，耗尽抛BudgetExhaustedError
[x] 实现 validator.py：ACL检查、危险模式检测(eval, os.system, rm -rf)、删除阈值
[x] 实现 sandbox.py：SandboxProvider协议 + LocalSandbox(子进程, Phase1)
[x] 实现 harness.py：Harness门面，validate→ACL→budget→execute→log流水线
[x] 实现 base.py：MaestroAgent基类（包装PydanticAI Agent，注入SSOT+Harness）
[x] 实现 tools.py：PydanticAI函数工具（ssot_read_requirements, ssot_write_contract等）
[x] 实现 contract_validator.py：非LLM规则引擎，检查消费者契约版本匹配

Week 4：集成 + 演示 + CI/CD

[x] 实现PlanningAgent和BackendAgent（PydanticAI TestModel可测）
[x] 构建Phase 1演示 src/maestro/demo.py
- 场景：输入"实现JWT用户认证系统"
- PlanningAgent分解为需求树 → BackendAgent读取契约并实现 → 打印SSOT状态
[x] 集成测试：SSOTHub + Harness + 两Agent端到端（TestModel）
[x] 设置GitHub Actions CI：lint(ruff) + type-check(mypy) + test(pytest --cov)
[ ] 文档：architecture.md, README.md

Phase 1交付物：两个Agent可读写SSOT并在沙箱中执行代码

Phase 1.5：CLI 产品化（第5-10周，与Phase 2并行）

目标：产品化为类似 Claude Code / Codex 的 CLI 应用，带 Rich TUI 输出

设计决策：

CLI框架：Typer
TUI输出：Rich
REPL输入：prompt_toolkit
配置：分层 TOML —— ~/.maestro/config.toml（全局）+ .maestro/config.toml（项目级），项目级优先级更高
状态持久化：每项目 .maestro/ssot.db（SQLite）
初始化：惰性自动初始化（首次运行命令时自动创建 .maestro/，无需手动 init）

CLI 命令

maestro                          # 交互式 REPL（默认入口）
maestro run <task-description>   # 非交互执行任务
maestro status                   # 显示项目状态
maestro config [key] [value]     # 查看/编辑配置
maestro demo                     # Phase 1 演示

Week 5-6：CLI 基础 + 状态输出（与Auto-Loop并行）

[x] 添加 typer、prompt-toolkit 到根 pyproject.toml dependencies
[x] 实现 src/maestro/config.py：配置 Pydantic 模型、分层加载（全局/项目/CLI覆盖）
[x] 实现 src/maestro/project.py：项目 init、.maestro/ 目录管理、项目根解析
[x] 实现 src/maestro/session.py：会话状态管理（SSOTHub + Harness 单例）
[x] 实现 src/maestro/output.py：Rich 渲染函数（需求树、契约表、日志表、横幅）
[x] 替换 src/maestro/cli.py stub → Typer app（init、status、config、demo、--version）
[x] 实现 src/maestro/commands/status.py：maestro status 命令
[x] 实现 src/maestro/commands/config_cmd.py：maestro config 命令
[x] 添加 Harness.from_policy 类方法（libs/maestro-harness/）
[x] 内嵌默认 harness_policy.yaml（importlib.resources）
[x] 更新 .gitignore：忽略 .maestro/*.db

Week 7-8：Run 命令 + 流式输出（与Agent角色并行）

[x] 实现 src/maestro/commands/run.py：非交互任务执行
[ ] 实现 Rich Live 流式输出（Agent 执行进度实时显示）
[x] Auto-Loop 集成：有 AutoLoop 用 AutoLoop，无则退化为顺序执行
[x] 异步桥接：asyncio.run() 处理 async agent.run()
[x] 集成测试：maestro run --task "..." 使用 TestModel

Week 9-10：REPL + 打磨（与回滚+Contract-Bench并行）

[x] 实现 src/maestro/repl.py：prompt_toolkit REPL，命令解析，历史记录
[x] REPL Chat 模式：普通输入 = 提交需求，/command = 执行命令
[x] REPL 内置 slash 命令：/task, /status, /contracts, /log, /config, /help, /quit
[x] Tab 补全（slash 命令名 + 需求ID）
[ ] 错误 UX：Rich 面板显示错误 + 行动建议
[x] maestro（无参数）默认启动 REPL
[x] 更新 CI 加入 CLI 测试
[x] 更新 README 和 architecture.md

Week 9-10b：.env 支持 + 交互式配置向导

[x] 修改 config.py：全局配置目录 ~/.maestro/ 自动初始化（config.toml）
[x] 修改 session.py：简化配置加载流程
[x] 修改 project.py：
- init_project() 显式创建完整项目配置（config.toml，全注释模板）
- find_project_root() 查找 .maestro/ 目录（排除 home 目录）
[x] 重写 config_cmd.py：maestro config 无参数时启动交互式向导，API key 直接写入 config.toml
[x] API key 统一放在 config.toml（不再使用 .env），.gitignore 已忽略 .maestro/

Week 9-10c：配置统一化重构

[x] 删除 maestro init 命令：首次运行任意命令自动创建完整 .maestro/config.toml（全注释模板）
[x] provider/model 分离：model 字段不再包含 provider: 前缀，独立 provider 字段
[x] Agent 合并模式：默认 5 个 agent + config 覆盖/新增/删除（disabled = true）
[x] Agent 权限控制：AgentPermissions 控制 SSOT tool 注册（read/write requirements/contracts/memory/log）
[x] 删除 policy.yaml：Harness 策略合并到 config.toml 的 [[agents]] 块中
- filesystem_allow/deny、commands_allow/deny、max_tokens/steps/wall_time、rollback
[x] 配置即真相：config.toml 中有什么 agent，系统就创建什么 agent，无隐藏 fallback

Phase 1.5 交付物：可安装的 CLI 应用，支持 maestro run/status/config 和交互式 REPL

新增文件结构

src/maestro/
    cli.py              # Typer app + 子命令入口
    config.py           # 配置模型、分层加载、合并
    project.py          # 项目初始化、.maestro/ 管理
    session.py          # 会话状态（hub、harness、agents）
    repl.py             # 交互式 REPL（prompt_toolkit）
    output.py           # Rich 渲染函数
    demo.py             # 已有：Phase 1 演示
    commands/
        __init__.py
        run.py           # maestro run 逻辑
        status.py        # maestro status 逻辑
        config_cmd.py    # maestro config 逻辑

配置文件 Schema

~/.maestro/config.toml（全局，默认生效）:

toml

[llm]
provider = "anthropic"
model = "claude-sonnet-4-20250514"

[[agents]]
role = "planner"
provider = "anthropic"
model = "claude-sonnet-4-20250514"
keywords = ["plan", "decompose", "design", "architecture"]

[[agents]]
role = "backend"
provider = "anthropic"
model = "claude-sonnet-4-20250514"
filesystem_allow = [
    "src/**", "tests/**", "migrations/**",
    "*.py", "*.toml", "*.yaml", "*.json", "*.md",
]
filesystem_deny = ["*.env"]
commands_allow = ["python", "pytest", "pip", "git", "uv"]
commands_deny = ["rm -rf", "curl", "wget", "docker", "sudo"]
max_tokens = 50000
max_steps = 30
max_wall_time = 10
rollback = true

# ... frontend, test, review 同理

.maestro/config.toml（项目级，默认全注释，继承全局）:

toml

# [project]
# name = "my-project"

# [llm]
# provider = "openai"
# model = "gpt-4o"

# Agent configuration is treated as a WHOLE:
# defining any [[agents]] here replaces the ENTIRE global agent list.
# [[agents]]
# role = "backend"
# filesystem_allow = ["src/**", "tests/**"]
# filesystem_deny = ["*.env"]
# commands_allow = ["python", "pytest", "git"]
# commands_deny = ["rm -rf", "curl", "wget", "docker", "sudo"]
#
# [agents.permissions]
# read_requirements = true
# write_requirements = true
# read_contracts = true
# write_contracts = true
# read_memory = true
# write_memory = true
# read_log = true

Phase 2：核心系统（第5-10周）

目标：完整Auto-Loop + 所有Agent角色 + 回滚机制 + Contract-Bench v1

Week 5-6：Auto-Loop控制器

[x] 实现 states.py：LoopState枚举(PLAN|ASSIGN|EXEC|VALID|COMMIT|CHECK|DONE|FAIL)
[x] 实现 controller.py：基于while+match的确定性状态机（pydantic-graph暂不可用）
- PLAN: 调用PlanningAgent分解需求 → 写入RequirementsTree
- ASSIGN: Scheduler分配子任务给Agent
- EXEC: Harness包装执行Agent操作
- VALID: ReviewAgent + ContractValidator + 测试运行
- COMMIT: SSOT快照
- CHECK: 需求满足验证 → DONE或重试
[x] ~~实现 scheduler.py：关键词匹配分配~~ → 替换为 LLMScheduler：PydanticAI 结构化输出评估任务语义，选择最佳 agent
[x] 实现 retry.py：全局重试预算N，耗尽后升级给人类
[x] 测试：状态转换正确性、重试逻辑、失败回退到PLAN

Week 7-8：所有Agent角色

[x] FrontendAgent：消费API契约、实现UI组件、调用前端工具链
[x] TestAgent：生成pytest/jest测试、报告覆盖率
[x] ReviewAgent：检查契约合规、代码质量、安全问题
[ ] 优化PlanningAgent提示：处理歧义、检测矛盾、支持重新规划（Phase 3细化）
[x] 所有Agent使用PydanticAI TestModel单元测试

Week 7-8b：LLM 多 Provider 集成

[x] 扩展 config.py：支持 llm.providers 配置（base_url, api_key）
[x] 新增 llm.py：模型解析层，支持 openai/anthropic/deepseek/ollama/vllm 等
[x] 修改 MaestroAgent.__init__：接受 model 参数注入 PydanticAI
[x] 修改 Session：按角色解析模型配置（get_provider_for + get_model_for）
[x] 修改 run.py / cli.py：命令行 --model 覆盖（支持 provider:model 或纯 model）
[x] 修改 AutoLoop：状态机中按角色传入差异化模型
[x] 测试：配置解析、模型对象创建、向后兼容

Week 7-8c：动态 Agent 配置

[x] 扩展 AgentConfig：支持 provider, system_prompt, keywords, disabled, permissions
[x] 新增 agent_factory.py：根据配置动态创建 Agent 子类，注入权限和 harness 策略
[x] 重写 run.py：默认 5 个 agent 为基础，config 同 role 覆盖、新 role 追加、disabled 删除
[x] 修改 AutoLoop：接受 scheduler_model 参数，传给 LLMScheduler
[x] 删除关键词匹配逻辑：移除 _DEFAULT_ROLE_KEYWORDS、role_keywords 参数
[x] 更新默认配置模板：全局模板写出 5 个默认 agent（仅 role + harness 策略，继承 [llm] 的 provider/model），项目模板全注释
[x] 测试：配置即真相（无 agent 配置时报错，不偷偷创建默认 agent）

Week 9-10：回滚机制 + Contract-Bench v1

[x] 实现Git快照回滚（GitPython）：执行前tag，失败时reset
[x] 实现时期级回滚：整个epoch失败时恢复所有Agent操作
[x] 构建Contract-Bench v1框架（3个示例任务，可扩展至20个）
- REST API + React前端
- Python CLI + SQLite
- WebSocket服务器 + HTML客户端
- 每个任务触及≥3文件，需要跨模块协调
[x] 端到端测试：完整Auto-Loop在模拟任务上运行

Phase 2交付物：端到端自主执行简单任务

Phase 2.5：Controller 模式重构（架构升级）

目标：将 AutoLoop 从硬编码状态机升级为 LLM 驱动的 Controller，让 LLM 拥有流程控制权。

问题：确定性状态机（PLAN→ASSIGN→EXEC→VALID→COMMIT→CHECK）无法处理非标准输入（如问候语、简单问题），Planner 被强制分解所有输入，导致资源浪费。

方案：引入 LLMController，每个迭代由 LLM 观察完整 SSOT 状态并决定下一步操作。

核心变更

1. `states.py` — 新增 `ControllerAction`

python

class ControllerAction(BaseModel):
    action: Literal["DECOMPOSE", "ASSIGN", "EXECUTE", "VERIFY", "COMMIT", "ANSWER", "DONE"]
    target_requirement_id: str | None = None
    agent_role: str | None = None
    answer: str | None = None       # 当 action="ANSWER" 时直接回复用户
    reasoning: str

2. `controller.py` — 重构 `AutoLoop`

新增 LLMController：PydanticAI Agent，输出 ControllerAction

AutoLoop.run() 改为 LLM 驱动循环：

python

while True:
    prompt = self._build_controller_prompt(requirement)
    action = await self.controller.decide(prompt)
    match action.action:
        case "DECOMPOSE": await self._do_plan()
        case "ASSIGN":    await self._do_assign()
        case "EXECUTE":   await self._do_exec()
        case "VERIFY":    await self._do_valid()
        case "COMMIT":    self._do_commit()
        case "ANSWER":    return LoopResult(DONE, message=action.answer)
        case "DONE":      return LoopResult(DONE, message="All done")

_build_controller_prompt()：构建丰富的状态快照（pending/in-progress/done 数量、任务列表、历史操作、可用 agents）
保留 _do_* 方法：只做工作，不再设置状态；返回 bool/结果供 Controller 决策参考
TestModel fallback：LLMController._test_decide() 提供确定性规则，确保 CI 测试通过

3. `run.py` — 简化调用流程

去掉外部 Planner 调用（原 run_task 中先调 Planner 再进 AutoLoop）
直接创建 AutoLoop 并 loop.run(task)，由 Controller 统一决策
Demo 模式保持原逻辑不变

行为对比

输入	旧架构（状态机）	新架构（Controller）
"你好"	Planner 强制分解为 5 个子任务	Controller 判断 ANSWER，直接回复
"做一个待办API"	PLAN→ASSIGN→EXEC→VALID→...	DECOMPOSE→ASSIGN→EXECUTE→VERIFY→DONE
"查看状态"	被当作需求分解	Controller 判断 ANSWER

设计哲学

不过度设计意图识别：没有独立的意图分类器；判断权交给 LLM Controller
Agents 保持单一职责：Planner 只负责分解，Backend 只负责实现；流程控制由 Controller 统一决策
向后兼容：TestModel fallback 保证所有现有测试通过
[x] 重构 states.py：添加 ControllerAction
[x] 重构 controller.py：LLMController + LLM 驱动 AutoLoop
[x] 重构 run.py：去掉外部 Planner，统一由 AutoLoop 接管
[x] 更新集成测试：test_autoloop_end_to_end 适配新 Controller
[x] 验证：29 tests pass, ruff + mypy clean

Phase 2.5 交付物：LLM 驱动的 Controller，能智能判断是否需要分解或直接回答

Phase 2.5b：TUI 流式输出 + 首屏美化

目标：解决 LLM 执行期间空白屏幕的等待焦虑，提升 REPL 视觉体验。

已实现

[x] ThinkingAnimation：Rich Spinner（dots）在 LLM 调用期间显示动态旋转提示
[x] 打字机效果：结果返回后逐字符输出，8ms/字符，长文本自动加速
[x] Token 统计：AgentResult 增加 input_tokens/output_tokens，每步显示 📥 / 📤
[x] Controller 动画：decide() 调用前显示 controller deciding 动画
[x] ASCII 蜥蜴 Logo：首屏左侧显示绿色 ASCII 蜥蜴（MAESTRO 吉祥物）
[x] Agents 列表：首屏右侧 Panel 显示所有可用 agent 的 role + model
[x] 29 tests pass，ruff + mypy clean

设计要点

ThinkingAnimation 放在 maestro-agents 包中，通过 animation 参数注入 MaestroAgent.run()
Controller、Planner、Backend、Frontend、Review 各有独立动画标签和颜色
ANSWER 路径直接走打字机输出，避免重复打印

Phase 2.5b 交付物：带动态动画和打字机效果的 TUI

Phase 2.5c：流式 Thinking + 工具调用可视化

目标：将执行过程从"黑盒"变为"白盒"，用户能实时看到 agent 的思考过程和工具调用。

已实现

src/maestro/streaming.py：AgentStreamLogger 消费 PydanticAI run_stream_events()
- PartDeltaEvent(ThinkingPartDelta) → 实时打印 🧠 thinking 内容
- FunctionToolCallEvent → 打印 → agent calling tool(args)
- FunctionToolResultEvent → 打印 ← tool returned result（截断）
MaestroAgent.run_stream()：返回 AgentEventStream，TestModel 自动 fallback 到 run()
config.toml 全局默认 thinking = true，支持按 agent 覆盖
AutoLoop 的 _do_plan / _do_exec / _do_valid 全部改用 AgentStreamLogger
ModelSettings(thinking=...) 桥接到 PydanticAI Agent 构造函数

设计决策

流式 vs 动画：流式输出替代了 ThinkingAnimation spinner，用户看到真实内容而非转圈
TestModel fallback：run_stream() 返回 None 时自动回退到 run()，保证 CI 测试通过
截断策略：工具结果截断到 200 字符，防止屏幕被长输出淹没
Token 统计：仍保留 📥 input / 📤 output tokens 行

Phase 2.5c 交付物：实时显示 thinking 内容和工具调用的流式执行体验

Phase 3：评估与加固（第11-18周）

目标：80个Contract-Bench任务 + 全部基线实验 + 消融研究

Week 11-13：扩展基准测试

[x] 扩展 Contract-Bench 到 10 个任务，覆盖多难度级别
- 简单(4)：CB-01 JWT Auth, CB-04 User CRUD, CB-05 File Upload
- 中等(4)：CB-02 Todo API, CB-06 Payment Gateway, CB-07 Notifications, CB-08 Search Engine
- 困难(2)：CB-03 Chat WebSocket, CB-09 Microservices Order, CB-10 Collaborative Editor
[x] 增强评估器：
- check_endpoints()：AST 提取 FastAPI 端点并与契约定义比较
- check_code_quality()：ruff + mypy 双重检查
- run_pytest_with_coverage()：测试执行 + 覆盖率提取
- default_score()：25%×4 权重 + 覆盖率 bonus（最高 +0.1）
[ ] 扩展至 20+ 任务（待后续迭代）
[x] SWE-bench 风格任务验证：创建真实 bug 修复任务（validate_email 拒绝 + 标签），backend agent 成功修复并通过全部测试
[ ] 筛选 SWE-bench Multi 子集（约 50 个跨模块任务）

Week 14-16：基线实现

[x] Single-Agent 基线：SingleAgentBaseline — 一个 backend agent 处理整个任务
[x] Message-Passing 多 Agent 基线（替代 AutoGen）：MessagePassingBaseline — 多 agent 通过 orchestrator 传递消息，无 SSOT
[x] Eval CLI：maestro eval <task> — 一键运行 MAESTRO-SSOT + 两个 baseline，输出对比表格 + JSON
[x] 数据收集管道：maestro benchmark — 批量运行 Contract-Bench 任务，自动收集 success/score/time/tokens/iterations 到 CSV/JSON
[ ] Agyn 基线（如果代码可用；否则概念性对比）
[x] Eval --method 过滤：支持只跑特定方法（如 --method maestro-ssot）

Week 17-18：消融研究 + 稳定性加固

[x] 稳定性加固（提前实现）
- src/maestro/resilience.py：超时包装 + 指数退避重试
- MaestroAgent.run(timeout=...)：单个 agent 调用 120s 超时
- AgentStreamLogger.run(timeout=...)：streaming 180s 超时
- LLMController.decide(timeout=..., max_retries=2)：Controller 决策重试
- AutoLoop(timeout_seconds=600)：全局 wall-clock 600s 超时
- is_retryable_error()：区分 transient（rate limit/timeout）vs 非 retryable 错误
[x] 消融实验1：去掉SSOT — MessagePassingBaseline（消息传递多 Agent）
[x] 消融实验2：去掉Harness — NoHarness（ACL/budget/validation 全 bypass）
[x] 消融实验3：去掉Auto-Loop — NoLoopBaseline（单次 plan→assign→execute，无迭代）
[x] 统计显著性检验：scripts/stats_analysis.py
[x] Controller ANSWER 移除：消除 Controller 对简单任务走捷径的问题，所有任务必须走 agent 执行

Phase 3.6：Agent 文件/Shell 工具暴露（关键补丁）

目标：让 Agent 能真正读写文件、执行命令，从"模拟"变为"真实"。

已完成

[x] AgentPermissions 扩展：新增 read_files, write_files, execute_shell
[x] tools.py 新增工具：file_read, file_write, shell_exec（通过 Harness.execute_action）
[x] Harness FILE_READ 支持：execute_action() 处理 FILE_READ → sandbox.read_file()
[x] Validator ACL 扩展：FILE_READ 纳入路径白名单检查
[x] Harness 路径规范化：绝对路径自动转为相对于 sandbox_root，确保 ACL 模式匹配正确
[x] Config 模板更新：test/review agent 补全 filesystem/command 权限
[x] Agent System Prompt 更新：backend/frontend/test/review 均提示可用文件/命令工具
[x] 测试覆盖：Harness FILE_READ/roundtrip/shell_exec + Agent 权限控制 + ACL 拦截
[x] Mock agent 签名修复：run() 接受 **kwargs，兼容 controller 的 timeout 传参
[x] Harness policy fallback：未知 agent ID（如 backend-test）自动 fallback 到角色前缀（backend_agent）
[x] System prompt 自动注入 policy：base.py 自动将 allowed/denied 路径和命令注入 agent system prompt
[x] Root requirement 状态修复：允许 system/controller agent 更新 root requirement 状态
[x] 动态依赖更新 API：update_requirement_dependencies() + ssot_update_requirement_dependencies agent 工具
[x] Agent prompt 注入 Requirement ID：_do_exec 明确告诉 agent 要更新哪个任务的状态
[x] ruff + mypy 全过

关键设计决策：

工具层调用 harness.execute_action()，复用现有 ACL/Budget/Validation 流水线
Action 模型字段全部设默认值，支持灵活构造（shell 命令无需 target/action_type）
路径规范化在 Harness 层完成，agent 可传绝对或相对路径
LocalSandbox 将相对路径解析到 sandbox_root，确保文件操作在正确目录下执行

Phase 3交付物：完整实验结果，统计显著，agent 可真实读写文件并执行命令

Phase 3.5：系统打磨（artifact 可复现性）

已实现

Dockerfile：基于 python:3.12-slim + uv 的可复现环境
docker-compose.yml：一键运行 benchmark 并导出结果
README 完善：
- Docker 使用指南
- maestro eval / maestro benchmark 命令说明
- Contract-Bench 任务列表和用法
- 配置说明（thinking 模式、Harness Policy）
examples/todo-app/：端到端示例项目配置

Phase 3.5 交付物：Docker化 + 文档完善的可复现实验环境

Phase 4：论文撰写（第19-26周）

目标：完成论文草稿 + 开源工件包

Week 19-22：论文初稿

[x] Introduction：问题动机、核心贡献（草稿）
[x] Related Work：单Agent/多Agent SE、共享状态、安全防护（草稿）
[x] Method：系统架构、SSOT设计、Harness设计、Auto-Loop设计（草稿）
[ ] Experiments：完整实验设置、结果（表格+图表）、讨论

Week 23-24：工件包准备

[ ] 代码清理和文档完善
[ ] 一键基准测试执行脚本
[ ] Docker化实验环境（可复现性）
[ ] README详细使用指南

Week 25-26：内部审查与提交

[ ] 内部审稿、修订
[ ] 按目标会议格式排版
[ ] 提交到ICSE/FSE/ASE

Phase 4交付物：论文草稿完成，待提交

Phase 5：缓冲与修订（第27-36周）

[ ] 根据审稿人反馈修订
[ ] 补充实验（如需要）
[ ] 准备camera-ready版本
[ ] 工件评估准备

Phase 5交付物：已发表论文

关键接口设计

SSOTHub API

python

class SSOTHub:
    def __init__(self, db_path: str) -> None: ...
    def initialize(self, requirement: str) -> RequirementNode: ...
    # Requirements
    def add_requirement(self, parent_id: str | None, desc: str) -> RequirementNode: ...
    def get_requirement(self, node_id: str) -> RequirementNode | None: ...
    def update_requirement_status(self, node_id: str, status: ReqStatus, agent_id: str) -> None: ...
    def claim_requirement(self, node_id: str, agent_id: str) -> None: ...
    def list_pending(self) -> list[RequirementNode]: ...
    # Contracts
    def register_contract(self, name: str, owner: str, schema: dict) -> Contract: ...
    def get_contract(self, contract_id: str) -> Contract | None: ...
    def update_contract_schema(self, contract_id: str, new_schema: dict, agent_id: str) -> Contract: ...
    # Execution Log
    def log_action(self, entry: ExecutionLogEntry) -> None: ...
    def query_log(self, agent_id: str | None = None) -> list[ExecutionLogEntry]: ...
    # Memory
    def write_memory(self, agent_id: str, key: str, value: str) -> None: ...
    def read_memory(self, agent_id: str, key: str) -> str | None: ...
    # Versioning
    def create_snapshot(self, description: str) -> SSOTVersion: ...
    def restore_snapshot(self, version_id: str) -> None: ...
    def export_yaml(self, directory: str) -> None: ...

Harness API

python

class Harness:
    def __init__(self, policy_path: str) -> None: ...           # 文件路径方式（保留）
    @classmethod
    def from_policy(cls, policy: HarnessPolicy, sandbox_root: str = ".") -> Harness: ...
    def validate_action(self, agent_id: str, action: Action) -> ValidationResult: ...
    def execute_action(self, agent_id: str, action: Action) -> ActionResult: ...
    def rollback(self, agent_id: str) -> None: ...
    def check_budget(self, agent_id: str) -> BudgetStatus: ...

Agent Base API

python

class MaestroAgent:
    def __init__(self, agent_id: str, role: str, hub: SSOTHub, harness: Harness) -> None: ...
    async def run(self, prompt: str) -> AgentResult: ...

AutoLoop API

python

class AutoLoop:
    def __init__(self, hub: SSOTHub, harness: Harness, agents: list[MaestroAgent]) -> None: ...
    async def run(self, requirement: str) -> LoopResult: ...
    @property
    def state(self) -> LoopState: ...

CI/CD配置

.github/workflows/ci.yml：push/PR触发

lint: uv run ruff check . + uv run ruff format --check .
type-check: uv run mypy libs/
test: uv run pytest --cov --cov-report=xml -v

工具配置（根 pyproject.toml）：

ruff: target py312, line-length 100, select E/F/W/I/N/UP/B/A/SIM/TCH
mypy: strict mode, python 3.12
pytest: asyncio mode, coverage >= 80%

验证方案

Phase 1验证

bash

uv sync                                    # 安装所有依赖
uv run ruff check .                        # lint通过
uv run mypy libs/                          # 类型检查通过
uv run pytest --cov                        # 所有测试通过，覆盖率>=80%
uv run maestro demo                        # 演示两个Agent读写SSOT

Phase 2验证

bash

uv run maestro run --task "Implement a REST API with JWT auth"  # 完整Auto-Loop运行
uv run maestro status                    # 查看SSOT状态（自动初始化项目）
uv run maestro                           # 交互式REPL
uv run pytest tests/test_integration.py    # 集成测试通过

# LLM 多 Provider 验证
export ANTHROPIC_API_KEY="..."
uv run maestro run --task "Build a todo API" --model "claude-sonnet-4-20250514"
uv run maestro run --task "Hello world" --model "llama3.2"          # 本地 Ollama
uv run pytest tests/test_llm_integration.py                           # LLM 配置测试

Phase 3验证

bash

uv run maestro bench --all-baselines       # 运行全部基线对比
uv run maestro bench --ablation            # 消融实验

风险与缓解

风险	缓解措施
PydanticAI API不稳定	固定版本，状态机逻辑封装在controller内
SQLite并发写瓶颈	Phase1可接受，LockManager在应用层防语义冲突，Phase3可迁移PostgreSQL
TestModel不能验证语义	CI跑TestModel，手动跑真实LLM demo验证
配置复杂度	所有配置统一在 config.toml 中，按 agent 分组，无隐藏文件
Docker沙箱范围	Phase1用LocalSandbox，SandboxProvider协议支持无缝切换
Agyn代码不可用	用AutoGen作为主要基线，Agyn做概念性对比

MAESTRO-SSOT 开发计划 ​

Context ​

Monorepo 目录结构 ​

包依赖关系（DAG） ​

Phase 1：基础（第1-4周） ​

Week 1：仓库初始化 + SSOT模型 ​

Week 2：SSOT域逻辑 + 访问控制 ​

Week 3：最小Harness + Agent基类 ​

Week 4：集成 + 演示 + CI/CD ​

Phase 1.5：CLI 产品化（第5-10周，与Phase 2并行） ​

CLI 命令 ​

Week 5-6：CLI 基础 + 状态输出（与Auto-Loop并行） ​

Week 7-8：Run 命令 + 流式输出（与Agent角色并行） ​

Week 9-10：REPL + 打磨（与回滚+Contract-Bench并行） ​

Week 9-10b：.env 支持 + 交互式配置向导 ​

Week 9-10c：配置统一化重构 ​

新增文件结构 ​

配置文件 Schema ​

Phase 2：核心系统（第5-10周） ​

Week 5-6：Auto-Loop控制器 ​

Week 7-8：所有Agent角色 ​

Week 7-8b：LLM 多 Provider 集成 ​

Week 7-8c：动态 Agent 配置 ​

Week 9-10：回滚机制 + Contract-Bench v1 ​

Phase 2.5：Controller 模式重构（架构升级） ​

核心变更 ​

1. states.py — 新增 ControllerAction ​

2. controller.py — 重构 AutoLoop ​

3. run.py — 简化调用流程 ​

行为对比 ​

设计哲学 ​

Phase 2.5b：TUI 流式输出 + 首屏美化 ​

已实现 ​

设计要点 ​

Phase 2.5c：流式 Thinking + 工具调用可视化 ​

已实现 ​

设计决策 ​

Phase 3：评估与加固（第11-18周） ​

Week 11-13：扩展基准测试 ​

Week 14-16：基线实现 ​

Week 17-18：消融研究 + 稳定性加固 ​

Phase 3.6：Agent 文件/Shell 工具暴露（关键补丁） ​

已完成 ​

Phase 3.5：系统打磨（artifact 可复现性） ​

已实现 ​

Phase 4：论文撰写（第19-26周） ​

Week 19-22：论文初稿 ​

Week 23-24：工件包准备 ​

Week 25-26：内部审查与提交 ​

Phase 5：缓冲与修订（第27-36周） ​

关键接口设计 ​

SSOTHub API ​

Harness API ​

Agent Base API ​

AutoLoop API ​

CI/CD配置 ​

验证方案 ​

Phase 1验证 ​

Phase 2验证 ​

Phase 3验证 ​

风险与缓解 ​

MAESTRO-SSOT 开发计划

Context

Monorepo 目录结构

包依赖关系（DAG）

Phase 1：基础（第1-4周）

Week 1：仓库初始化 + SSOT模型

Week 2：SSOT域逻辑 + 访问控制

Week 3：最小Harness + Agent基类

Week 4：集成 + 演示 + CI/CD

Phase 1.5：CLI 产品化（第5-10周，与Phase 2并行）

CLI 命令

Week 5-6：CLI 基础 + 状态输出（与Auto-Loop并行）

Week 7-8：Run 命令 + 流式输出（与Agent角色并行）

Week 9-10：REPL + 打磨（与回滚+Contract-Bench并行）

Week 9-10b：.env 支持 + 交互式配置向导

Week 9-10c：配置统一化重构

新增文件结构

配置文件 Schema

Phase 2：核心系统（第5-10周）

Week 5-6：Auto-Loop控制器

Week 7-8：所有Agent角色

Week 7-8b：LLM 多 Provider 集成

Week 7-8c：动态 Agent 配置

Week 9-10：回滚机制 + Contract-Bench v1

Phase 2.5：Controller 模式重构（架构升级）

核心变更

1. `states.py` — 新增 `ControllerAction`

2. `controller.py` — 重构 `AutoLoop`

3. `run.py` — 简化调用流程

行为对比

设计哲学

Phase 2.5b：TUI 流式输出 + 首屏美化

已实现

设计要点

Phase 2.5c：流式 Thinking + 工具调用可视化

已实现

设计决策

Phase 3：评估与加固（第11-18周）

Week 11-13：扩展基准测试

Week 14-16：基线实现

Week 17-18：消融研究 + 稳定性加固

Phase 3.6：Agent 文件/Shell 工具暴露（关键补丁）

已完成

Phase 3.5：系统打磨（artifact 可复现性）

已实现

Phase 4：论文撰写（第19-26周）

Week 19-22：论文初稿

Week 23-24：工件包准备

Week 25-26：内部审查与提交

Phase 5：缓冲与修订（第27-36周）

关键接口设计

SSOTHub API

Harness API

Agent Base API

AutoLoop API

CI/CD配置

验证方案

Phase 1验证

Phase 2验证

Phase 3验证

风险与缓解