Files
memabra/docs/ALPHA_ITERATION_1_PLAN.md
2026-04-15 11:06:05 +08:00

12 KiB
Raw Permalink Blame History

memabra Alpha Iteration 1 Plan

For Hermes: continue this plan autonomously in small TDD-driven increments. Each run should complete one or more concrete tasks, update this file's progress section, run targeted tests first, then run the full memabra test suite.

Goal: turn memabra from a showable prototype into a safe self-improving alpha by adding an online learning loop with automatic training, evaluation, gated promotion, and rollback-safe router deployment.

Architecture:

  • Keep the current layered design.
  • Do not replace existing routers; add an orchestration layer around them.
  • Promotion must be benchmark-gated: no automatic router switch without passing evaluation thresholds.
  • Persist every training/promotion attempt as an auditable artifact.

Tech stack:

  • Existing memabra Python package under src/memabra/
  • Existing pytest suite under tests/memabra/
  • Existing persistence via JSON artifacts; keep it simple for alpha

Acceptance criteria

Alpha Iteration 1 is complete when memabra can:

  1. detect newly accumulated trajectories
  2. build a training dataset from eligible trajectories
  3. train a challenger router automatically
  4. run challenger vs baseline on a fixed benchmark set
  5. promote challenger only if thresholds are met
  6. save a versioned promoted router
  7. keep an auditable training/promotion report
  8. leave the currently active router unchanged when challenger loses

Implementation phases

Phase A — Benchmark-gated online learning loop

Task A1: Add a promotion policy object

Objective: define explicit acceptance rules for promoting a challenger router.

Files:

  • Create: src/memabra/promotion.py
  • Create: tests/memabra/test_promotion.py

Required behavior:

  • Define a PromotionPolicy dataclass
  • Inputs should include at least:
    • min_reward_delta
    • max_error_rate_increase
    • max_latency_increase_ms
    • required_task_count
  • Provide evaluate(baseline, challenger) -> PromotionDecision
  • PromotionDecision should include:
    • accepted: bool
    • reasons: list[str]
    • metrics: dict

TDD steps:

  1. Write failing tests for accepted and rejected cases.
  2. Run targeted tests and verify failure.
  3. Implement minimal policy logic.
  4. Re-run targeted tests.
  5. Re-run full memabra suite.

Task A2: Add benchmark suite persistence

Objective: store and load a fixed benchmark task set for repeatable evaluations.

Files:

  • Create: src/memabra/benchmarks.py
  • Create: tests/memabra/test_benchmarks.py

Required behavior:

  • Define a serializable benchmark suite format
  • Load/save benchmark tasks from JSON
  • Provide a default benchmark seed for memory/tool/skill/composite coverage

TDD steps:

  1. Write failing benchmark round-trip tests.
  2. Verify RED.
  3. Implement load/save helpers.
  4. Verify GREEN.
  5. Run full suite.

Task A3: Add online training coordinator

Objective: orchestrate dataset selection, training, evaluation, and promotion.

Files:

  • Create: src/memabra/online_learning.py
  • Create: tests/memabra/test_online_learning.py

Required behavior:

  • Define OnlineLearningCoordinator
  • It should:
    • query trajectories from ArtifactIndex
    • enforce minimum new trajectory count
    • train a challenger with DatasetBuilder
    • evaluate challenger with Evaluator
    • apply PromotionPolicy
    • save promoted routers via RouterVersionStore
    • emit a structured report whether accepted or rejected

TDD steps:

  1. Write failing tests for:
  • skip when too few new trajectories
  • reject when policy fails
  • accept and save version when policy passes
  1. Verify failure.
  2. Implement minimal coordinator.
  3. Verify targeted tests.
  4. Run full suite.

Phase B — Auditability and safe deployment

Task B1: Add training run reports

Objective: persist every online-learning attempt, not just successful promotions.

Files:

  • Extend: src/memabra/persistence.py or create src/memabra/training_reports.py
  • Create: tests/memabra/test_training_reports.py

Required behavior:

  • Save a JSON report per training run
  • Include:
    • timestamp
    • source trajectory ids
    • sample count
    • baseline metrics
    • challenger metrics
    • promotion decision
    • promoted version id if any

Task B2: Add active router metadata tracking

Objective: make it obvious which router is active and why.

Files:

  • Extend: src/memabra/router_versioning.py
  • Extend: tests/memabra/test_router_versioning.py

Required behavior:

  • Track metadata for current active router
  • Record promotion source, benchmark result summary, and prior version
  • Make rollback preserve audit trail

Phase C — Product surface and automation

Task C1: Add app-level online learning entrypoint

Objective: expose one-call retrain/evaluate/promote behavior from MemabraApp.

Files:

  • Extend: src/memabra/app.py
  • Extend: tests/memabra/test_app.py

Required behavior:

  • Add a method like run_online_learning_cycle(...)
  • Return a structured result dict/report

Task C2: Add CLI entrypoint for the alpha loop

Objective: make the safe online-learning loop runnable from the command line.

Files:

  • Extend: src/memabra/cli.py
  • Extend: tests/memabra/test_cli_workflow.py
  • Update: docs/projects/memabra/DEMO.md

Required behavior:

  • Add a callable workflow that:
    • seeds or uses existing artifacts
    • runs one online-learning cycle
    • prints the report JSON

Task C3: Update docs and wrap-up materials

Objective: document the alpha loop clearly.

Files:

  • Update: docs/projects/memabra/PROGRESS.md
  • Update: docs/projects/memabra/ROADMAP.md
  • Update: docs/projects/memabra/DEMO.md
  • Optional: create docs/projects/memabra/ONLINE_LEARNING.md

Required behavior:

  • Explain promotion gates
  • Explain how to run one cycle manually
  • Explain where reports and versions are stored

Suggested run order for autonomous 20-minute cycles

Cycle group 1:

  • A1 promotion policy
  • A2 benchmark suite persistence

Cycle group 2:

  • A3 online training coordinator

Cycle group 3:

  • B1 training run reports
  • B2 active router metadata tracking

Cycle group 4:

  • C1 app-level entrypoint
  • C2 CLI workflow
  • C3 docs cleanup

Estimated autonomous runs

Recommended initial budget: 18 runs at every 20 minutes.

Reasoning:

  • 3 to 4 runs for Phase A
  • 3 to 4 runs for Phase B
  • 2 to 3 runs for Phase C
  • remaining runs as slack for regression fixes, docs cleanup, and one or two extra quality passes

At 20 minutes per run, 18 runs gives about 6 hours of autonomous iteration, which is a reasonable overnight alpha push.


Progress tracker

  • Task A1 — promotion policy
  • Task A2 — benchmark suite persistence
  • Task A3 — online training coordinator
  • Task B1 — training run reports
  • Task B2 — active router metadata tracking
  • Task C1 — app-level online learning entrypoint
  • Task C2 — CLI online learning workflow
  • Task C3 — docs cleanup and operator guidance
  • Task D1 — baseline version selection for online learning
  • Task E1 — task case index for episodic retrieval

Run log

  • 2026-04-14: Plan created. Ready for autonomous overnight execution.
  • 2026-04-14 22:52 UTC: Completed Tasks A1A3. Promotion policy, benchmark persistence, and online training coordinator implemented with tests. Full suite: 71 passed.
  • 2026-04-14 23:22 UTC: Completed Tasks B1C3. Training reports, active router metadata tracking, app/CLI entrypoints, and docs implemented with tests. Full suite: 78 passed.
  • 2026-04-14 23:24 UTC: Quality pass — CLI main() now defaults to online-learning workflow, fixed schema test resource warning, added missing alpha module exports to package init.py. Full suite: 82 passed.
  • 2026-04-14 23:50 UTC: Docs and repo hygiene pass — updated DEMO.md and ONLINE_LEARNING.md to reflect that python -m src.memabra.cli runs the online-learning workflow; added docs/projects/memabra/demo-artifacts/ to .gitignore; verified CLI end-to-end (promoted=true, version saved, report emitted). Full suite: 82 passed.
  • 2026-04-15 00:49 UTC: Safety and usability pass — added exception handling in OnlineLearningCoordinator so training/evaluation failures emit error reports instead of crashing; added CLI argument parsing (--base-dir, --min-new-trajectories); fixed python -m src.memabra.cli RuntimeWarning via lazy cli import; added TrainingReportStore.get_report() for by-id lookup; exported BenchmarkTask from package __init__.py; updated DEMO.md and ONLINE_LEARNING.md. Full suite: 88 passed.
  • 2026-04-15 01:15 UTC: Repo hygiene and commit pass — verified end-to-end CLI workflow produced a promoted router, version, and report; updated .gitignore to exclude runtime artifact directories (router-versions/, training-reports/); committed entire memabra alpha codebase (67 files, 6,818 insertions). Full suite: 88 passed.
  • 2026-04-15 02:00 UTC: Persistence pass — OnlineLearningCoordinator now supports seen_trajectory_store to persist seen trajectory IDs across restarts, preventing duplicate retraining in cron jobs. Added test_coordinator_persists_seen_trajectory_ids_across_restarts. Fixed evaluation leakage by refreshing the artifact index after benchmarking and marking post-evaluation trajectories as seen. Wired seen_trajectory_store through app.py and cli.py; CLI now defaults to <base-dir>/seen-trajectories.json. Added corresponding tests. Full suite: 91 passed.
  • 2026-04-15 02:27 UTC: Dry-run pass — committed pending persistence-pass changes, then added --dry-run CLI flag and dry_run parameter through the full stack (OnlineLearningCoordinator, app.py, cli.py). In dry-run mode training and evaluation execute but promotion and version saving are skipped; an audit report is still emitted with dry_run: true. Added test_coordinator_dry_run_does_not_promote_or_save_version and test_main_entrypoint_passes_dry_run_flag. Updated ONLINE_LEARNING.md. Full suite: 93 passed.
  • 2026-04-15 02:51 UTC: Baseline-version pass — added baseline_version_id parameter to OnlineLearningCoordinator.run_cycle(), MemabraApp.run_online_learning_cycle(), and CLI --baseline-version flag. This lets operators evaluate a challenger against a specific saved router version rather than the currently active one. Added tests for coordinator, app, and CLI. Updated ONLINE_LEARNING.md. Full suite: 96 passed.
  • 2026-04-15 03:18 UTC: Verification pass — confirmed all tasks A1D1 are complete and stable. Ran full memabra suite (96 passed) and end-to-end CLI workflow (promoted=true, version saved, report emitted). No code changes required; repo is clean and ready for operator review.
  • 2026-04-15 04:02 UTC: Started Phase E — added CaseIndex (src/memabra/case_index.py) for task-level episodic retrieval. Maps normalized task inputs to the highest-reward trajectory ID, with JSON save/load. Added tests/memabra/test_case_index.py (4 tests). Full suite: 100 passed.
  • 2026-04-15 04:27 UTC: Integrated CaseIndex into MemabraApp and MemabraRunner for episodic retrieval. Added app-level methods (build_case_index, save_case_index, load_case_index, best_trajectory_for). Runner now injects an episodic memory candidate when a case index hit occurs. Added CLI flags --case-index and --rebuild-case-index. Updated docs. Full suite: 107 passed.
  • 2026-04-15 04:54 UTC: Added case_index_path support to OnlineLearningCoordinator so the case index is automatically rebuilt after each online-learning cycle (including benchmark-generated trajectories). Wired parameter through app.py and cli.py. Added tests for coordinator, app, and CLI. Full suite: 110 passed.
  • 2026-04-15 05:18 UTC: Added TrajectorySummarizer (src/memabra/trajectory_summary.py) for generating human-readable trajectory summaries. Integrated summarizer into MemabraRunner so episodic memory candidates contain rich summaries when a persistence_store is available. Added tests/memabra/test_trajectory_summary.py (4 tests) and updated runner test. Full suite: 114 passed.
  • 2026-04-15 05:42 UTC: Added CLI --status flag (src/memabra/cli.py) to print current system state (active router version, version count, trajectory count, report count, latest report summary) without running a learning cycle. Added tests/memabra/test_cli_workflow.py::test_main_status_flag_prints_status_and_skips_workflow. Full suite: 115 passed.
  • 2026-04-15 06:05 UTC: Added CLI --rollback and --list-versions flags for operator-safe router version management. Added error handling for missing rollback targets (exits 1 with clean message). Added corresponding tests. Full suite: 118 passed. Updated ONLINE_LEARNING.md and DEMO.md documentation.