253 lines
12 KiB
Markdown
253 lines
12 KiB
Markdown
# memabra Alpha Iteration 1 Plan
|
||
|
||
> For Hermes: continue this plan autonomously in small TDD-driven increments. Each run should complete one or more concrete tasks, update this file's progress section, run targeted tests first, then run the full memabra test suite.
|
||
|
||
Goal: turn memabra from a showable prototype into a safe self-improving alpha by adding an online learning loop with automatic training, evaluation, gated promotion, and rollback-safe router deployment.
|
||
|
||
Architecture:
|
||
- Keep the current layered design.
|
||
- Do not replace existing routers; add an orchestration layer around them.
|
||
- Promotion must be benchmark-gated: no automatic router switch without passing evaluation thresholds.
|
||
- Persist every training/promotion attempt as an auditable artifact.
|
||
|
||
Tech stack:
|
||
- Existing memabra Python package under `src/memabra/`
|
||
- Existing pytest suite under `tests/memabra/`
|
||
- Existing persistence via JSON artifacts; keep it simple for alpha
|
||
|
||
---
|
||
|
||
## Acceptance criteria
|
||
|
||
Alpha Iteration 1 is complete when memabra can:
|
||
1. detect newly accumulated trajectories
|
||
2. build a training dataset from eligible trajectories
|
||
3. train a challenger router automatically
|
||
4. run challenger vs baseline on a fixed benchmark set
|
||
5. promote challenger only if thresholds are met
|
||
6. save a versioned promoted router
|
||
7. keep an auditable training/promotion report
|
||
8. leave the currently active router unchanged when challenger loses
|
||
|
||
---
|
||
|
||
## Implementation phases
|
||
|
||
### Phase A — Benchmark-gated online learning loop
|
||
|
||
#### Task A1: Add a promotion policy object
|
||
Objective: define explicit acceptance rules for promoting a challenger router.
|
||
|
||
Files:
|
||
- Create: `src/memabra/promotion.py`
|
||
- Create: `tests/memabra/test_promotion.py`
|
||
|
||
Required behavior:
|
||
- Define a `PromotionPolicy` dataclass
|
||
- Inputs should include at least:
|
||
- `min_reward_delta`
|
||
- `max_error_rate_increase`
|
||
- `max_latency_increase_ms`
|
||
- `required_task_count`
|
||
- Provide `evaluate(baseline, challenger) -> PromotionDecision`
|
||
- `PromotionDecision` should include:
|
||
- `accepted: bool`
|
||
- `reasons: list[str]`
|
||
- `metrics: dict`
|
||
|
||
TDD steps:
|
||
1. Write failing tests for accepted and rejected cases.
|
||
2. Run targeted tests and verify failure.
|
||
3. Implement minimal policy logic.
|
||
4. Re-run targeted tests.
|
||
5. Re-run full memabra suite.
|
||
|
||
#### Task A2: Add benchmark suite persistence
|
||
Objective: store and load a fixed benchmark task set for repeatable evaluations.
|
||
|
||
Files:
|
||
- Create: `src/memabra/benchmarks.py`
|
||
- Create: `tests/memabra/test_benchmarks.py`
|
||
|
||
Required behavior:
|
||
- Define a serializable benchmark suite format
|
||
- Load/save benchmark tasks from JSON
|
||
- Provide a default benchmark seed for memory/tool/skill/composite coverage
|
||
|
||
TDD steps:
|
||
1. Write failing benchmark round-trip tests.
|
||
2. Verify RED.
|
||
3. Implement load/save helpers.
|
||
4. Verify GREEN.
|
||
5. Run full suite.
|
||
|
||
#### Task A3: Add online training coordinator
|
||
Objective: orchestrate dataset selection, training, evaluation, and promotion.
|
||
|
||
Files:
|
||
- Create: `src/memabra/online_learning.py`
|
||
- Create: `tests/memabra/test_online_learning.py`
|
||
|
||
Required behavior:
|
||
- Define `OnlineLearningCoordinator`
|
||
- It should:
|
||
- query trajectories from `ArtifactIndex`
|
||
- enforce minimum new trajectory count
|
||
- train a challenger with `DatasetBuilder`
|
||
- evaluate challenger with `Evaluator`
|
||
- apply `PromotionPolicy`
|
||
- save promoted routers via `RouterVersionStore`
|
||
- emit a structured report whether accepted or rejected
|
||
|
||
TDD steps:
|
||
1. Write failing tests for:
|
||
- skip when too few new trajectories
|
||
- reject when policy fails
|
||
- accept and save version when policy passes
|
||
2. Verify failure.
|
||
3. Implement minimal coordinator.
|
||
4. Verify targeted tests.
|
||
5. Run full suite.
|
||
|
||
### Phase B — Auditability and safe deployment
|
||
|
||
#### Task B1: Add training run reports
|
||
Objective: persist every online-learning attempt, not just successful promotions.
|
||
|
||
Files:
|
||
- Extend: `src/memabra/persistence.py` or create `src/memabra/training_reports.py`
|
||
- Create: `tests/memabra/test_training_reports.py`
|
||
|
||
Required behavior:
|
||
- Save a JSON report per training run
|
||
- Include:
|
||
- timestamp
|
||
- source trajectory ids
|
||
- sample count
|
||
- baseline metrics
|
||
- challenger metrics
|
||
- promotion decision
|
||
- promoted version id if any
|
||
|
||
#### Task B2: Add active router metadata tracking
|
||
Objective: make it obvious which router is active and why.
|
||
|
||
Files:
|
||
- Extend: `src/memabra/router_versioning.py`
|
||
- Extend: `tests/memabra/test_router_versioning.py`
|
||
|
||
Required behavior:
|
||
- Track metadata for current active router
|
||
- Record promotion source, benchmark result summary, and prior version
|
||
- Make rollback preserve audit trail
|
||
|
||
### Phase C — Product surface and automation
|
||
|
||
#### Task C1: Add app-level online learning entrypoint
|
||
Objective: expose one-call retrain/evaluate/promote behavior from `MemabraApp`.
|
||
|
||
Files:
|
||
- Extend: `src/memabra/app.py`
|
||
- Extend: `tests/memabra/test_app.py`
|
||
|
||
Required behavior:
|
||
- Add a method like `run_online_learning_cycle(...)`
|
||
- Return a structured result dict/report
|
||
|
||
#### Task C2: Add CLI entrypoint for the alpha loop
|
||
Objective: make the safe online-learning loop runnable from the command line.
|
||
|
||
Files:
|
||
- Extend: `src/memabra/cli.py`
|
||
- Extend: `tests/memabra/test_cli_workflow.py`
|
||
- Update: `docs/projects/memabra/DEMO.md`
|
||
|
||
Required behavior:
|
||
- Add a callable workflow that:
|
||
- seeds or uses existing artifacts
|
||
- runs one online-learning cycle
|
||
- prints the report JSON
|
||
|
||
#### Task C3: Update docs and wrap-up materials
|
||
Objective: document the alpha loop clearly.
|
||
|
||
Files:
|
||
- Update: `docs/projects/memabra/PROGRESS.md`
|
||
- Update: `docs/projects/memabra/ROADMAP.md`
|
||
- Update: `docs/projects/memabra/DEMO.md`
|
||
- Optional: create `docs/projects/memabra/ONLINE_LEARNING.md`
|
||
|
||
Required behavior:
|
||
- Explain promotion gates
|
||
- Explain how to run one cycle manually
|
||
- Explain where reports and versions are stored
|
||
|
||
---
|
||
|
||
## Suggested run order for autonomous 20-minute cycles
|
||
|
||
Cycle group 1:
|
||
- A1 promotion policy
|
||
- A2 benchmark suite persistence
|
||
|
||
Cycle group 2:
|
||
- A3 online training coordinator
|
||
|
||
Cycle group 3:
|
||
- B1 training run reports
|
||
- B2 active router metadata tracking
|
||
|
||
Cycle group 4:
|
||
- C1 app-level entrypoint
|
||
- C2 CLI workflow
|
||
- C3 docs cleanup
|
||
|
||
---
|
||
|
||
## Estimated autonomous runs
|
||
|
||
Recommended initial budget: 18 runs at every 20 minutes.
|
||
|
||
Reasoning:
|
||
- 3 to 4 runs for Phase A
|
||
- 3 to 4 runs for Phase B
|
||
- 2 to 3 runs for Phase C
|
||
- remaining runs as slack for regression fixes, docs cleanup, and one or two extra quality passes
|
||
|
||
At 20 minutes per run, 18 runs gives about 6 hours of autonomous iteration, which is a reasonable overnight alpha push.
|
||
|
||
---
|
||
|
||
## Progress tracker
|
||
|
||
- [x] Task A1 — promotion policy
|
||
- [x] Task A2 — benchmark suite persistence
|
||
- [x] Task A3 — online training coordinator
|
||
- [x] Task B1 — training run reports
|
||
- [x] Task B2 — active router metadata tracking
|
||
- [x] Task C1 — app-level online learning entrypoint
|
||
- [x] Task C2 — CLI online learning workflow
|
||
- [x] Task C3 — docs cleanup and operator guidance
|
||
- [x] Task D1 — baseline version selection for online learning
|
||
- [x] Task E1 — task case index for episodic retrieval
|
||
|
||
## Run log
|
||
|
||
- 2026-04-14: Plan created. Ready for autonomous overnight execution.
|
||
- 2026-04-14 22:52 UTC: Completed Tasks A1–A3. Promotion policy, benchmark persistence, and online training coordinator implemented with tests. Full suite: 71 passed.
|
||
- 2026-04-14 23:22 UTC: Completed Tasks B1–C3. Training reports, active router metadata tracking, app/CLI entrypoints, and docs implemented with tests. Full suite: 78 passed.
|
||
- 2026-04-14 23:24 UTC: Quality pass — CLI main() now defaults to online-learning workflow, fixed schema test resource warning, added missing alpha module exports to package __init__.py. Full suite: 82 passed.
|
||
- 2026-04-14 23:50 UTC: Docs and repo hygiene pass — updated DEMO.md and ONLINE_LEARNING.md to reflect that `python -m src.memabra.cli` runs the online-learning workflow; added `docs/projects/memabra/demo-artifacts/` to `.gitignore`; verified CLI end-to-end (promoted=true, version saved, report emitted). Full suite: 82 passed.
|
||
- 2026-04-15 00:49 UTC: Safety and usability pass — added exception handling in `OnlineLearningCoordinator` so training/evaluation failures emit error reports instead of crashing; added CLI argument parsing (`--base-dir`, `--min-new-trajectories`); fixed `python -m src.memabra.cli` RuntimeWarning via lazy `cli` import; added `TrainingReportStore.get_report()` for by-id lookup; exported `BenchmarkTask` from package `__init__.py`; updated DEMO.md and ONLINE_LEARNING.md. Full suite: 88 passed.
|
||
- 2026-04-15 01:15 UTC: Repo hygiene and commit pass — verified end-to-end CLI workflow produced a promoted router, version, and report; updated `.gitignore` to exclude runtime artifact directories (`router-versions/`, `training-reports/`); committed entire memabra alpha codebase (67 files, 6,818 insertions). Full suite: 88 passed.
|
||
- 2026-04-15 02:00 UTC: Persistence pass — `OnlineLearningCoordinator` now supports `seen_trajectory_store` to persist seen trajectory IDs across restarts, preventing duplicate retraining in cron jobs. Added `test_coordinator_persists_seen_trajectory_ids_across_restarts`. Fixed evaluation leakage by refreshing the artifact index after benchmarking and marking post-evaluation trajectories as seen. Wired `seen_trajectory_store` through `app.py` and `cli.py`; CLI now defaults to `<base-dir>/seen-trajectories.json`. Added corresponding tests. Full suite: 91 passed.
|
||
- 2026-04-15 02:27 UTC: Dry-run pass — committed pending persistence-pass changes, then added `--dry-run` CLI flag and `dry_run` parameter through the full stack (`OnlineLearningCoordinator`, `app.py`, `cli.py`). In dry-run mode training and evaluation execute but promotion and version saving are skipped; an audit report is still emitted with `dry_run: true`. Added `test_coordinator_dry_run_does_not_promote_or_save_version` and `test_main_entrypoint_passes_dry_run_flag`. Updated `ONLINE_LEARNING.md`. Full suite: 93 passed.
|
||
- 2026-04-15 02:51 UTC: Baseline-version pass — added `baseline_version_id` parameter to `OnlineLearningCoordinator.run_cycle()`, `MemabraApp.run_online_learning_cycle()`, and CLI `--baseline-version` flag. This lets operators evaluate a challenger against a specific saved router version rather than the currently active one. Added tests for coordinator, app, and CLI. Updated `ONLINE_LEARNING.md`. Full suite: 96 passed.
|
||
- 2026-04-15 03:18 UTC: Verification pass — confirmed all tasks A1–D1 are complete and stable. Ran full memabra suite (96 passed) and end-to-end CLI workflow (promoted=true, version saved, report emitted). No code changes required; repo is clean and ready for operator review.
|
||
- 2026-04-15 04:02 UTC: Started Phase E — added `CaseIndex` (`src/memabra/case_index.py`) for task-level episodic retrieval. Maps normalized task inputs to the highest-reward trajectory ID, with JSON save/load. Added `tests/memabra/test_case_index.py` (4 tests). Full suite: 100 passed.
|
||
- 2026-04-15 04:27 UTC: Integrated `CaseIndex` into `MemabraApp` and `MemabraRunner` for episodic retrieval. Added app-level methods (`build_case_index`, `save_case_index`, `load_case_index`, `best_trajectory_for`). Runner now injects an episodic memory candidate when a case index hit occurs. Added CLI flags `--case-index` and `--rebuild-case-index`. Updated docs. Full suite: 107 passed.
|
||
- 2026-04-15 04:54 UTC: Added `case_index_path` support to `OnlineLearningCoordinator` so the case index is automatically rebuilt after each online-learning cycle (including benchmark-generated trajectories). Wired parameter through `app.py` and `cli.py`. Added tests for coordinator, app, and CLI. Full suite: 110 passed.
|
||
- 2026-04-15 05:18 UTC: Added `TrajectorySummarizer` (`src/memabra/trajectory_summary.py`) for generating human-readable trajectory summaries. Integrated summarizer into `MemabraRunner` so episodic memory candidates contain rich summaries when a `persistence_store` is available. Added `tests/memabra/test_trajectory_summary.py` (4 tests) and updated runner test. Full suite: 114 passed.
|
||
- 2026-04-15 05:42 UTC: Added CLI `--status` flag (`src/memabra/cli.py`) to print current system state (active router version, version count, trajectory count, report count, latest report summary) without running a learning cycle. Added `tests/memabra/test_cli_workflow.py::test_main_status_flag_prints_status_and_skips_workflow`. Full suite: 115 passed.
|
||
- 2026-04-15 06:05 UTC: Added CLI `--rollback` and `--list-versions` flags for operator-safe router version management. Added error handling for missing rollback targets (exits 1 with clean message). Added corresponding tests. Full suite: 118 passed. Updated `ONLINE_LEARNING.md` and `DEMO.md` documentation.
|