12 KiB
memabra Alpha Iteration 1 Plan
For Hermes: continue this plan autonomously in small TDD-driven increments. Each run should complete one or more concrete tasks, update this file's progress section, run targeted tests first, then run the full memabra test suite.
Goal: turn memabra from a showable prototype into a safe self-improving alpha by adding an online learning loop with automatic training, evaluation, gated promotion, and rollback-safe router deployment.
Architecture:
- Keep the current layered design.
- Do not replace existing routers; add an orchestration layer around them.
- Promotion must be benchmark-gated: no automatic router switch without passing evaluation thresholds.
- Persist every training/promotion attempt as an auditable artifact.
Tech stack:
- Existing memabra Python package under
src/memabra/ - Existing pytest suite under
tests/memabra/ - Existing persistence via JSON artifacts; keep it simple for alpha
Acceptance criteria
Alpha Iteration 1 is complete when memabra can:
- detect newly accumulated trajectories
- build a training dataset from eligible trajectories
- train a challenger router automatically
- run challenger vs baseline on a fixed benchmark set
- promote challenger only if thresholds are met
- save a versioned promoted router
- keep an auditable training/promotion report
- leave the currently active router unchanged when challenger loses
Implementation phases
Phase A — Benchmark-gated online learning loop
Task A1: Add a promotion policy object
Objective: define explicit acceptance rules for promoting a challenger router.
Files:
- Create:
src/memabra/promotion.py - Create:
tests/memabra/test_promotion.py
Required behavior:
- Define a
PromotionPolicydataclass - Inputs should include at least:
min_reward_deltamax_error_rate_increasemax_latency_increase_msrequired_task_count
- Provide
evaluate(baseline, challenger) -> PromotionDecision PromotionDecisionshould include:accepted: boolreasons: list[str]metrics: dict
TDD steps:
- Write failing tests for accepted and rejected cases.
- Run targeted tests and verify failure.
- Implement minimal policy logic.
- Re-run targeted tests.
- Re-run full memabra suite.
Task A2: Add benchmark suite persistence
Objective: store and load a fixed benchmark task set for repeatable evaluations.
Files:
- Create:
src/memabra/benchmarks.py - Create:
tests/memabra/test_benchmarks.py
Required behavior:
- Define a serializable benchmark suite format
- Load/save benchmark tasks from JSON
- Provide a default benchmark seed for memory/tool/skill/composite coverage
TDD steps:
- Write failing benchmark round-trip tests.
- Verify RED.
- Implement load/save helpers.
- Verify GREEN.
- Run full suite.
Task A3: Add online training coordinator
Objective: orchestrate dataset selection, training, evaluation, and promotion.
Files:
- Create:
src/memabra/online_learning.py - Create:
tests/memabra/test_online_learning.py
Required behavior:
- Define
OnlineLearningCoordinator - It should:
- query trajectories from
ArtifactIndex - enforce minimum new trajectory count
- train a challenger with
DatasetBuilder - evaluate challenger with
Evaluator - apply
PromotionPolicy - save promoted routers via
RouterVersionStore - emit a structured report whether accepted or rejected
- query trajectories from
TDD steps:
- Write failing tests for:
- skip when too few new trajectories
- reject when policy fails
- accept and save version when policy passes
- Verify failure.
- Implement minimal coordinator.
- Verify targeted tests.
- Run full suite.
Phase B — Auditability and safe deployment
Task B1: Add training run reports
Objective: persist every online-learning attempt, not just successful promotions.
Files:
- Extend:
src/memabra/persistence.pyor createsrc/memabra/training_reports.py - Create:
tests/memabra/test_training_reports.py
Required behavior:
- Save a JSON report per training run
- Include:
- timestamp
- source trajectory ids
- sample count
- baseline metrics
- challenger metrics
- promotion decision
- promoted version id if any
Task B2: Add active router metadata tracking
Objective: make it obvious which router is active and why.
Files:
- Extend:
src/memabra/router_versioning.py - Extend:
tests/memabra/test_router_versioning.py
Required behavior:
- Track metadata for current active router
- Record promotion source, benchmark result summary, and prior version
- Make rollback preserve audit trail
Phase C — Product surface and automation
Task C1: Add app-level online learning entrypoint
Objective: expose one-call retrain/evaluate/promote behavior from MemabraApp.
Files:
- Extend:
src/memabra/app.py - Extend:
tests/memabra/test_app.py
Required behavior:
- Add a method like
run_online_learning_cycle(...) - Return a structured result dict/report
Task C2: Add CLI entrypoint for the alpha loop
Objective: make the safe online-learning loop runnable from the command line.
Files:
- Extend:
src/memabra/cli.py - Extend:
tests/memabra/test_cli_workflow.py - Update:
docs/projects/memabra/DEMO.md
Required behavior:
- Add a callable workflow that:
- seeds or uses existing artifacts
- runs one online-learning cycle
- prints the report JSON
Task C3: Update docs and wrap-up materials
Objective: document the alpha loop clearly.
Files:
- Update:
docs/projects/memabra/PROGRESS.md - Update:
docs/projects/memabra/ROADMAP.md - Update:
docs/projects/memabra/DEMO.md - Optional: create
docs/projects/memabra/ONLINE_LEARNING.md
Required behavior:
- Explain promotion gates
- Explain how to run one cycle manually
- Explain where reports and versions are stored
Suggested run order for autonomous 20-minute cycles
Cycle group 1:
- A1 promotion policy
- A2 benchmark suite persistence
Cycle group 2:
- A3 online training coordinator
Cycle group 3:
- B1 training run reports
- B2 active router metadata tracking
Cycle group 4:
- C1 app-level entrypoint
- C2 CLI workflow
- C3 docs cleanup
Estimated autonomous runs
Recommended initial budget: 18 runs at every 20 minutes.
Reasoning:
- 3 to 4 runs for Phase A
- 3 to 4 runs for Phase B
- 2 to 3 runs for Phase C
- remaining runs as slack for regression fixes, docs cleanup, and one or two extra quality passes
At 20 minutes per run, 18 runs gives about 6 hours of autonomous iteration, which is a reasonable overnight alpha push.
Progress tracker
- Task A1 — promotion policy
- Task A2 — benchmark suite persistence
- Task A3 — online training coordinator
- Task B1 — training run reports
- Task B2 — active router metadata tracking
- Task C1 — app-level online learning entrypoint
- Task C2 — CLI online learning workflow
- Task C3 — docs cleanup and operator guidance
- Task D1 — baseline version selection for online learning
- Task E1 — task case index for episodic retrieval
Run log
- 2026-04-14: Plan created. Ready for autonomous overnight execution.
- 2026-04-14 22:52 UTC: Completed Tasks A1–A3. Promotion policy, benchmark persistence, and online training coordinator implemented with tests. Full suite: 71 passed.
- 2026-04-14 23:22 UTC: Completed Tasks B1–C3. Training reports, active router metadata tracking, app/CLI entrypoints, and docs implemented with tests. Full suite: 78 passed.
- 2026-04-14 23:24 UTC: Quality pass — CLI main() now defaults to online-learning workflow, fixed schema test resource warning, added missing alpha module exports to package init.py. Full suite: 82 passed.
- 2026-04-14 23:50 UTC: Docs and repo hygiene pass — updated DEMO.md and ONLINE_LEARNING.md to reflect that
python -m src.memabra.cliruns the online-learning workflow; addeddocs/projects/memabra/demo-artifacts/to.gitignore; verified CLI end-to-end (promoted=true, version saved, report emitted). Full suite: 82 passed. - 2026-04-15 00:49 UTC: Safety and usability pass — added exception handling in
OnlineLearningCoordinatorso training/evaluation failures emit error reports instead of crashing; added CLI argument parsing (--base-dir,--min-new-trajectories); fixedpython -m src.memabra.cliRuntimeWarning via lazycliimport; addedTrainingReportStore.get_report()for by-id lookup; exportedBenchmarkTaskfrom package__init__.py; updated DEMO.md and ONLINE_LEARNING.md. Full suite: 88 passed. - 2026-04-15 01:15 UTC: Repo hygiene and commit pass — verified end-to-end CLI workflow produced a promoted router, version, and report; updated
.gitignoreto exclude runtime artifact directories (router-versions/,training-reports/); committed entire memabra alpha codebase (67 files, 6,818 insertions). Full suite: 88 passed. - 2026-04-15 02:00 UTC: Persistence pass —
OnlineLearningCoordinatornow supportsseen_trajectory_storeto persist seen trajectory IDs across restarts, preventing duplicate retraining in cron jobs. Addedtest_coordinator_persists_seen_trajectory_ids_across_restarts. Fixed evaluation leakage by refreshing the artifact index after benchmarking and marking post-evaluation trajectories as seen. Wiredseen_trajectory_storethroughapp.pyandcli.py; CLI now defaults to<base-dir>/seen-trajectories.json. Added corresponding tests. Full suite: 91 passed. - 2026-04-15 02:27 UTC: Dry-run pass — committed pending persistence-pass changes, then added
--dry-runCLI flag anddry_runparameter through the full stack (OnlineLearningCoordinator,app.py,cli.py). In dry-run mode training and evaluation execute but promotion and version saving are skipped; an audit report is still emitted withdry_run: true. Addedtest_coordinator_dry_run_does_not_promote_or_save_versionandtest_main_entrypoint_passes_dry_run_flag. UpdatedONLINE_LEARNING.md. Full suite: 93 passed. - 2026-04-15 02:51 UTC: Baseline-version pass — added
baseline_version_idparameter toOnlineLearningCoordinator.run_cycle(),MemabraApp.run_online_learning_cycle(), and CLI--baseline-versionflag. This lets operators evaluate a challenger against a specific saved router version rather than the currently active one. Added tests for coordinator, app, and CLI. UpdatedONLINE_LEARNING.md. Full suite: 96 passed. - 2026-04-15 03:18 UTC: Verification pass — confirmed all tasks A1–D1 are complete and stable. Ran full memabra suite (96 passed) and end-to-end CLI workflow (promoted=true, version saved, report emitted). No code changes required; repo is clean and ready for operator review.
- 2026-04-15 04:02 UTC: Started Phase E — added
CaseIndex(src/memabra/case_index.py) for task-level episodic retrieval. Maps normalized task inputs to the highest-reward trajectory ID, with JSON save/load. Addedtests/memabra/test_case_index.py(4 tests). Full suite: 100 passed. - 2026-04-15 04:27 UTC: Integrated
CaseIndexintoMemabraAppandMemabraRunnerfor episodic retrieval. Added app-level methods (build_case_index,save_case_index,load_case_index,best_trajectory_for). Runner now injects an episodic memory candidate when a case index hit occurs. Added CLI flags--case-indexand--rebuild-case-index. Updated docs. Full suite: 107 passed. - 2026-04-15 04:54 UTC: Added
case_index_pathsupport toOnlineLearningCoordinatorso the case index is automatically rebuilt after each online-learning cycle (including benchmark-generated trajectories). Wired parameter throughapp.pyandcli.py. Added tests for coordinator, app, and CLI. Full suite: 110 passed. - 2026-04-15 05:18 UTC: Added
TrajectorySummarizer(src/memabra/trajectory_summary.py) for generating human-readable trajectory summaries. Integrated summarizer intoMemabraRunnerso episodic memory candidates contain rich summaries when apersistence_storeis available. Addedtests/memabra/test_trajectory_summary.py(4 tests) and updated runner test. Full suite: 114 passed. - 2026-04-15 05:42 UTC: Added CLI
--statusflag (src/memabra/cli.py) to print current system state (active router version, version count, trajectory count, report count, latest report summary) without running a learning cycle. Addedtests/memabra/test_cli_workflow.py::test_main_status_flag_prints_status_and_skips_workflow. Full suite: 115 passed. - 2026-04-15 06:05 UTC: Added CLI
--rollbackand--list-versionsflags for operator-safe router version management. Added error handling for missing rollback targets (exits 1 with clean message). Added corresponding tests. Full suite: 118 passed. UpdatedONLINE_LEARNING.mdandDEMO.mddocumentation.