Files
memabra/docs/ALPHA_ITERATION_1_PLAN.md
2026-04-15 11:06:05 +08:00

253 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# memabra Alpha Iteration 1 Plan
> For Hermes: continue this plan autonomously in small TDD-driven increments. Each run should complete one or more concrete tasks, update this file's progress section, run targeted tests first, then run the full memabra test suite.
Goal: turn memabra from a showable prototype into a safe self-improving alpha by adding an online learning loop with automatic training, evaluation, gated promotion, and rollback-safe router deployment.
Architecture:
- Keep the current layered design.
- Do not replace existing routers; add an orchestration layer around them.
- Promotion must be benchmark-gated: no automatic router switch without passing evaluation thresholds.
- Persist every training/promotion attempt as an auditable artifact.
Tech stack:
- Existing memabra Python package under `src/memabra/`
- Existing pytest suite under `tests/memabra/`
- Existing persistence via JSON artifacts; keep it simple for alpha
---
## Acceptance criteria
Alpha Iteration 1 is complete when memabra can:
1. detect newly accumulated trajectories
2. build a training dataset from eligible trajectories
3. train a challenger router automatically
4. run challenger vs baseline on a fixed benchmark set
5. promote challenger only if thresholds are met
6. save a versioned promoted router
7. keep an auditable training/promotion report
8. leave the currently active router unchanged when challenger loses
---
## Implementation phases
### Phase A — Benchmark-gated online learning loop
#### Task A1: Add a promotion policy object
Objective: define explicit acceptance rules for promoting a challenger router.
Files:
- Create: `src/memabra/promotion.py`
- Create: `tests/memabra/test_promotion.py`
Required behavior:
- Define a `PromotionPolicy` dataclass
- Inputs should include at least:
- `min_reward_delta`
- `max_error_rate_increase`
- `max_latency_increase_ms`
- `required_task_count`
- Provide `evaluate(baseline, challenger) -> PromotionDecision`
- `PromotionDecision` should include:
- `accepted: bool`
- `reasons: list[str]`
- `metrics: dict`
TDD steps:
1. Write failing tests for accepted and rejected cases.
2. Run targeted tests and verify failure.
3. Implement minimal policy logic.
4. Re-run targeted tests.
5. Re-run full memabra suite.
#### Task A2: Add benchmark suite persistence
Objective: store and load a fixed benchmark task set for repeatable evaluations.
Files:
- Create: `src/memabra/benchmarks.py`
- Create: `tests/memabra/test_benchmarks.py`
Required behavior:
- Define a serializable benchmark suite format
- Load/save benchmark tasks from JSON
- Provide a default benchmark seed for memory/tool/skill/composite coverage
TDD steps:
1. Write failing benchmark round-trip tests.
2. Verify RED.
3. Implement load/save helpers.
4. Verify GREEN.
5. Run full suite.
#### Task A3: Add online training coordinator
Objective: orchestrate dataset selection, training, evaluation, and promotion.
Files:
- Create: `src/memabra/online_learning.py`
- Create: `tests/memabra/test_online_learning.py`
Required behavior:
- Define `OnlineLearningCoordinator`
- It should:
- query trajectories from `ArtifactIndex`
- enforce minimum new trajectory count
- train a challenger with `DatasetBuilder`
- evaluate challenger with `Evaluator`
- apply `PromotionPolicy`
- save promoted routers via `RouterVersionStore`
- emit a structured report whether accepted or rejected
TDD steps:
1. Write failing tests for:
- skip when too few new trajectories
- reject when policy fails
- accept and save version when policy passes
2. Verify failure.
3. Implement minimal coordinator.
4. Verify targeted tests.
5. Run full suite.
### Phase B — Auditability and safe deployment
#### Task B1: Add training run reports
Objective: persist every online-learning attempt, not just successful promotions.
Files:
- Extend: `src/memabra/persistence.py` or create `src/memabra/training_reports.py`
- Create: `tests/memabra/test_training_reports.py`
Required behavior:
- Save a JSON report per training run
- Include:
- timestamp
- source trajectory ids
- sample count
- baseline metrics
- challenger metrics
- promotion decision
- promoted version id if any
#### Task B2: Add active router metadata tracking
Objective: make it obvious which router is active and why.
Files:
- Extend: `src/memabra/router_versioning.py`
- Extend: `tests/memabra/test_router_versioning.py`
Required behavior:
- Track metadata for current active router
- Record promotion source, benchmark result summary, and prior version
- Make rollback preserve audit trail
### Phase C — Product surface and automation
#### Task C1: Add app-level online learning entrypoint
Objective: expose one-call retrain/evaluate/promote behavior from `MemabraApp`.
Files:
- Extend: `src/memabra/app.py`
- Extend: `tests/memabra/test_app.py`
Required behavior:
- Add a method like `run_online_learning_cycle(...)`
- Return a structured result dict/report
#### Task C2: Add CLI entrypoint for the alpha loop
Objective: make the safe online-learning loop runnable from the command line.
Files:
- Extend: `src/memabra/cli.py`
- Extend: `tests/memabra/test_cli_workflow.py`
- Update: `docs/projects/memabra/DEMO.md`
Required behavior:
- Add a callable workflow that:
- seeds or uses existing artifacts
- runs one online-learning cycle
- prints the report JSON
#### Task C3: Update docs and wrap-up materials
Objective: document the alpha loop clearly.
Files:
- Update: `docs/projects/memabra/PROGRESS.md`
- Update: `docs/projects/memabra/ROADMAP.md`
- Update: `docs/projects/memabra/DEMO.md`
- Optional: create `docs/projects/memabra/ONLINE_LEARNING.md`
Required behavior:
- Explain promotion gates
- Explain how to run one cycle manually
- Explain where reports and versions are stored
---
## Suggested run order for autonomous 20-minute cycles
Cycle group 1:
- A1 promotion policy
- A2 benchmark suite persistence
Cycle group 2:
- A3 online training coordinator
Cycle group 3:
- B1 training run reports
- B2 active router metadata tracking
Cycle group 4:
- C1 app-level entrypoint
- C2 CLI workflow
- C3 docs cleanup
---
## Estimated autonomous runs
Recommended initial budget: 18 runs at every 20 minutes.
Reasoning:
- 3 to 4 runs for Phase A
- 3 to 4 runs for Phase B
- 2 to 3 runs for Phase C
- remaining runs as slack for regression fixes, docs cleanup, and one or two extra quality passes
At 20 minutes per run, 18 runs gives about 6 hours of autonomous iteration, which is a reasonable overnight alpha push.
---
## Progress tracker
- [x] Task A1 — promotion policy
- [x] Task A2 — benchmark suite persistence
- [x] Task A3 — online training coordinator
- [x] Task B1 — training run reports
- [x] Task B2 — active router metadata tracking
- [x] Task C1 — app-level online learning entrypoint
- [x] Task C2 — CLI online learning workflow
- [x] Task C3 — docs cleanup and operator guidance
- [x] Task D1 — baseline version selection for online learning
- [x] Task E1 — task case index for episodic retrieval
## Run log
- 2026-04-14: Plan created. Ready for autonomous overnight execution.
- 2026-04-14 22:52 UTC: Completed Tasks A1A3. Promotion policy, benchmark persistence, and online training coordinator implemented with tests. Full suite: 71 passed.
- 2026-04-14 23:22 UTC: Completed Tasks B1C3. Training reports, active router metadata tracking, app/CLI entrypoints, and docs implemented with tests. Full suite: 78 passed.
- 2026-04-14 23:24 UTC: Quality pass — CLI main() now defaults to online-learning workflow, fixed schema test resource warning, added missing alpha module exports to package __init__.py. Full suite: 82 passed.
- 2026-04-14 23:50 UTC: Docs and repo hygiene pass — updated DEMO.md and ONLINE_LEARNING.md to reflect that `python -m src.memabra.cli` runs the online-learning workflow; added `docs/projects/memabra/demo-artifacts/` to `.gitignore`; verified CLI end-to-end (promoted=true, version saved, report emitted). Full suite: 82 passed.
- 2026-04-15 00:49 UTC: Safety and usability pass — added exception handling in `OnlineLearningCoordinator` so training/evaluation failures emit error reports instead of crashing; added CLI argument parsing (`--base-dir`, `--min-new-trajectories`); fixed `python -m src.memabra.cli` RuntimeWarning via lazy `cli` import; added `TrainingReportStore.get_report()` for by-id lookup; exported `BenchmarkTask` from package `__init__.py`; updated DEMO.md and ONLINE_LEARNING.md. Full suite: 88 passed.
- 2026-04-15 01:15 UTC: Repo hygiene and commit pass — verified end-to-end CLI workflow produced a promoted router, version, and report; updated `.gitignore` to exclude runtime artifact directories (`router-versions/`, `training-reports/`); committed entire memabra alpha codebase (67 files, 6,818 insertions). Full suite: 88 passed.
- 2026-04-15 02:00 UTC: Persistence pass — `OnlineLearningCoordinator` now supports `seen_trajectory_store` to persist seen trajectory IDs across restarts, preventing duplicate retraining in cron jobs. Added `test_coordinator_persists_seen_trajectory_ids_across_restarts`. Fixed evaluation leakage by refreshing the artifact index after benchmarking and marking post-evaluation trajectories as seen. Wired `seen_trajectory_store` through `app.py` and `cli.py`; CLI now defaults to `<base-dir>/seen-trajectories.json`. Added corresponding tests. Full suite: 91 passed.
- 2026-04-15 02:27 UTC: Dry-run pass — committed pending persistence-pass changes, then added `--dry-run` CLI flag and `dry_run` parameter through the full stack (`OnlineLearningCoordinator`, `app.py`, `cli.py`). In dry-run mode training and evaluation execute but promotion and version saving are skipped; an audit report is still emitted with `dry_run: true`. Added `test_coordinator_dry_run_does_not_promote_or_save_version` and `test_main_entrypoint_passes_dry_run_flag`. Updated `ONLINE_LEARNING.md`. Full suite: 93 passed.
- 2026-04-15 02:51 UTC: Baseline-version pass — added `baseline_version_id` parameter to `OnlineLearningCoordinator.run_cycle()`, `MemabraApp.run_online_learning_cycle()`, and CLI `--baseline-version` flag. This lets operators evaluate a challenger against a specific saved router version rather than the currently active one. Added tests for coordinator, app, and CLI. Updated `ONLINE_LEARNING.md`. Full suite: 96 passed.
- 2026-04-15 03:18 UTC: Verification pass — confirmed all tasks A1D1 are complete and stable. Ran full memabra suite (96 passed) and end-to-end CLI workflow (promoted=true, version saved, report emitted). No code changes required; repo is clean and ready for operator review.
- 2026-04-15 04:02 UTC: Started Phase E — added `CaseIndex` (`src/memabra/case_index.py`) for task-level episodic retrieval. Maps normalized task inputs to the highest-reward trajectory ID, with JSON save/load. Added `tests/memabra/test_case_index.py` (4 tests). Full suite: 100 passed.
- 2026-04-15 04:27 UTC: Integrated `CaseIndex` into `MemabraApp` and `MemabraRunner` for episodic retrieval. Added app-level methods (`build_case_index`, `save_case_index`, `load_case_index`, `best_trajectory_for`). Runner now injects an episodic memory candidate when a case index hit occurs. Added CLI flags `--case-index` and `--rebuild-case-index`. Updated docs. Full suite: 107 passed.
- 2026-04-15 04:54 UTC: Added `case_index_path` support to `OnlineLearningCoordinator` so the case index is automatically rebuilt after each online-learning cycle (including benchmark-generated trajectories). Wired parameter through `app.py` and `cli.py`. Added tests for coordinator, app, and CLI. Full suite: 110 passed.
- 2026-04-15 05:18 UTC: Added `TrajectorySummarizer` (`src/memabra/trajectory_summary.py`) for generating human-readable trajectory summaries. Integrated summarizer into `MemabraRunner` so episodic memory candidates contain rich summaries when a `persistence_store` is available. Added `tests/memabra/test_trajectory_summary.py` (4 tests) and updated runner test. Full suite: 114 passed.
- 2026-04-15 05:42 UTC: Added CLI `--status` flag (`src/memabra/cli.py`) to print current system state (active router version, version count, trajectory count, report count, latest report summary) without running a learning cycle. Added `tests/memabra/test_cli_workflow.py::test_main_status_flag_prints_status_and_skips_workflow`. Full suite: 115 passed.
- 2026-04-15 06:05 UTC: Added CLI `--rollback` and `--list-versions` flags for operator-safe router version management. Added error handling for missing rollback targets (exits 1 with clean message). Added corresponding tests. Full suite: 118 passed. Updated `ONLINE_LEARNING.md` and `DEMO.md` documentation.