Releases: ajhcs/healthcare-agents
Healthcare Agents v1.3.0
Healthcare Agents v1.3.0 Release Notes
Released: 2026-05-21
This release turns Healthcare Agents from a portable prompt pack into a more
usable product surface. The 51 specialist healthcare administration agents are
now easier to discover, inspect, route, install selectively, and evaluate before
use.
Highlights
- Published the package to npm as
healthcare-agents. - Added registry-backed discovery and provenance metadata for all 51 agents.
- Added CLI commands for
list,show,choose,prompt, anddoctor. - Added single-agent install support with slug validation.
- Improved installer dry-run and doctor output for safer file writes.
- Added public eval scorecard generation.
- Added trust and safety documentation covering scope, PHI limits, human
escalation, source freshness, and eval interpretation limits. - Added CI gates for lint, audit, package, CLI, and installer smoke checks.
CLI Product Surface
The CLI now supports direct discovery workflows:
npx --yes healthcare-agents list
npx --yes healthcare-agents show revenue-cycle-specialist
npx --yes healthcare-agents choose "clean claim rate dropped"
npx --yes healthcare-agents prompt quality-compliance-officer --mode audit/checklist
npx --yes healthcare-agents doctorUsers can now install one agent instead of the full pack:
npx --yes healthcare-agents install revenue-cycle-specialist --codex --dry-runTrust And Evaluation
The new registry and scorecard make the library easier to inspect without
opening every prompt manually. Scores remain internal prompt-rubric results, not
external certification, accreditation, legal review, coding validation, billing
approval, clinical validation, or compliance approval.
The prompts still do not create a PHI-safe runtime. Use approved environments,
minimum-necessary data, local source verification, and human sign-off for final
clinical, legal, coding, billing, audit, compliance, contracting, employment, or
executive decisions.
Validation
Validation performed before release:
bash -n install.shbash scripts/lint-agents.shpython3 scripts/audit-agents.py --top 10npm pack --dry-runnode bin/cli.js --helpnode bin/cli.js listbash install.sh --all --dry-runnode -c bin/cli.jsnode -c scripts/generate-scorecard.jsgit diff --check- GitHub Actions CI on pull request and
main
Healthcare Agents v1.2.0
Healthcare Agents v1.2.0 Release Notes
Released: 2026-05-05
This release makes the 51-agent healthcare administration library easier to use
without adding runtime complexity. The prompts remain plain Markdown and
generated SKILL.md packages, but the user experience is now more explicit:
choose the right agent, provide the right inputs, request the right output mode,
and see the right handoffs when work crosses departments.
Highlights
- Added task-based agent selection docs for common healthcare administration jobs.
- Added copy-ready starter prompts across all 10 domains.
- Added a cross-agent handoff map for workflows that span departments.
- Added role-tailored
Best Inputs,Output Modes, andCollaboration & Handoffssections to all 51 agent prompts. - Updated installer-managed Codex guidance so installed users get the new routing, output-mode, and handoff behavior.
- Added release-only usability smoke scenarios for future checks.
Agent Usability Improvements
Every agent now tells users what information produces the strongest answer,
which output modes it supports, which adjacent agents to involve, and which human
owners must make final high-risk decisions.
The four standardized output modes are:
quick triage: likely root causes, missing data, immediate checks, escalation triggers.workplan: owners, dependencies, KPIs, sequence, validation checkpoints.audit/checklist: evidence requests, pass/fail criteria, remediation owners.artifact/template: a draft deliverable with assumptions, placeholders, and review notes.
Documentation
New usage docs:
docs/usage/agent-selection-guide.mddocs/usage/starter-prompts.mddocs/usage/handoff-map.mddocs/eval/usability-release-check.md
The README now includes a compact "Choose the Right Agent" section with common
starting points and output modes.
Validation
Validation performed before release:
bash scripts/lint-agents.shpython3 scripts/audit-agents.py --top 20bash install.sh --all --dry-runnode bin/cli.js --helpgit diff --check- Usability smoke scenarios from
docs/eval/usability-release-check.md
v1.1.2 - GitHub npx Install
Healthcare Agents v1.1.2 Release Notes
Released: 2026-04-23
This patch release corrects the install docs after verifying that healthcare-agents is not yet published on the public npm registry from this environment.
Changed
-
README and INSTALL examples now use the working GitHub-backed command:
npx --yes github:ajhcs/healthcare-agents install
-
Package and installer metadata now report
1.1.2.
Validation
npx --yes github:ajhcs/healthcare-agents install --versionbash scripts/lint-agents.shbash install.sh --all --dry-runnpm pack --dry-run
v1.1.1 - Installer Compatibility
Healthcare Agents v1.1.1 Release Notes
Released: 2026-04-23
This patch release focuses on installability and cross-tool compatibility after the v1.1.0 agent-stack optimization release.
Highlights
- Agent frontmatter now uses lowercase hyphen
namevalues that match filenames, with human-readable labels preserved indisplay_name. - The installer now supports Codex App aliases, Claude Desktop/Cowork aliases, OpenCode skills, Claude Skills, and portable
.agents/skillsoutput. - Codex installs now add a managed
~/.codex/AGENTS.mdblock so Codex knows how to select and read the installed healthcare specialists. - The README and installation guide now reflect the v1.1.x eval status and current cross-tool file layouts.
- The self-improvement kit installer now copies all 51 role baselines.
Validation
bash scripts/lint-agents.shbash -n install.shbash install.sh --all --dry-run- Temp-home install test for Claude agents, Codex agents/instructions, Claude skills, OpenCode skills, and
.agents/skills
v1.1.0 - Agent Stack Optimization
Healthcare Agents v1.1.0 Release Notes
Released: 2026-04-23
This release upgrades the healthcare-agents stack from a broad first release into a calibrated, eval-driven 51-agent library. The work focused on two things: improving every installable healthcare agent, and making the eval/improvement loop reliable enough to run under current SOTA coding and reasoning models.
Highlights
- Improved all 51 healthcare administration agents.
- Rebuilt the eval workflow around native subagents and model specialization.
- Added role baselines for every installable agent.
- Required exact, persisted Q001-Q025 question artifacts for before/after comparisons.
- Removed the unused Python/DSPy harness and consolidated the project around the lightweight self-improvement workflow.
Agent Quality Improvements
All 51 prompts were evaluated and improved in two passes:
- First 15 agents: average score improved from 85.0 to 93.9.
- Remaining 36 agents: average score improved from 85.11 to 95.50.
The prompt changes were intentionally narrow. They sharpen role mechanics, regulatory boundaries, source hierarchies, handoffs, deliverables, and edge-case behavior without flattening the agents into generic healthcare-administration assistants.
Major improvement areas included:
- Clinical operations: observation/SNF status, utilization notices, infection prevention attribution, research consent and closeout controls, EMTALA transfer handling, and emergency-preparedness activation details.
- Health IT: Epic master-file dependencies, interoperability replay/backfill controls, USCDI/TEFCA readiness, telehealth payer matrices, PHI extract governance, and AI/ambient documentation controls.
- Payer and value-based care: network adequacy evidence, credentialing adverse-file routing, Medicare outreach boundaries, attribution and quality-gate controls, and downside-risk readiness.
- Quality and population health: CAHPS setting selection, PSWP/PSES boundaries, QI/SPC mechanics, accreditation evidence, surveillance reporting matrices, CBO MOU controls, and community-benefit documentation.
- Revenue and pharmacy: 340B duplicate-discount controls, CDM edit checks, contract analytics source hierarchy, EDI denial workflows, finance reserve boundaries, coding appeal source hierarchy, and medication-safety governance.
- Strategy: actuarial certification/reliance caveats, MLR workflow detail, opportunity-sizing formulas, and predictive-operations validation checks.
Eval System Changes
The active eval system is now the lightweight self-improvement kit:
.claude/commands/eval.mdeval/rubric.mdeval/role-baselines/eval/meta/eval/run-logs/README.mddocs/eval/exam-architect-playbook.mddocs/eval/model-tuning.md
The workflow now prefers four roles when the runtime supports it:
- Parent orchestrator: owns preflight, git writes, run logs, and commit/revert decisions.
- Scorer/judge: strongest available reasoning model; generates exams and critiques answers.
- Editor: faster strong model; edits only the requested agent prompt.
- Adjudicator: optional different model family for close deltas, high-risk roles, or release scoring.
The eval command now requires before/after or score-only baseline runs to persist full question artifacts before answers are generated. Focus labels and weak-area summaries are no longer enough. Retests must identify whether they used exact baseline questions or fresh comparable questions.
Cleanup
The old Python/DSPy harness was removed because it was not the active path for agent improvement. Deleted components included the deeper harness implementation, schema models, legacy JSON rubrics, tests, and the shell runner.
This reduces maintenance burden and makes the repo's active improvement path clearer for both Codex and Claude Code.
Validation
Validation performed before release:
bash scripts/lint-agents.shgit diff --check- Exact-question retests using retained Q001-Q025 question artifacts for the remaining 36-agent pass.
The final tracked state contains 51 lint-clean agent prompts and the simplified eval stack.
v1.0.0 — 51 Healthcare Admin Agents
Release Notes: 10-Agent Eval Loop Milestone
Date: 2026-04-09
Headline
10 of 51 healthcare administration agents now score 80+ on a rigorous 0--100 automated eval, up from zero. This is the first known automated improvement loop for healthcare admin AI agents -- iterative exam generation, rubric-locked scoring, targeted prompt edits, and git-ratcheted commits, all running without human intervention.
What Changed
We shipped a complete /eval improvement loop. Each iteration works like this:
- Generate a 25-question domain exam from the agent's system prompt.
- Score answers against a frozen rubric weighted Accuracy 0.40, Completeness 0.35, Specificity 0.25.
- Identify the weakest areas and propose targeted prompt edits (with explicit identity-preservation constraints so prompts get sharper, not flatter).
- Edit the agent prompt -- additive, high-leverage changes only, respecting a fixed line cap.
- Re-score using the same frozen question set.
- Commit or revert automatically. If the score improved, the edit stays and a row is appended to
eval/results.tsv. If not, the file is restored. No regressions ship.
The loop uses a split-role architecture: a strong scorer/judge model generates exams and critiques, a faster editor model patches prompts, and a parent orchestrator owns git writes and the append-only log. This avoids the identity drift that comes from letting a single model optimize itself.
Agents Improved
All 10 agents crossed the 80-point threshold. Best post-edit scores:
| Agent | Best Score | Key Improvements |
|---|---|---|
| Revenue Medical Coding Specialist | 82.15 | LCD/NCD medical necessity, charge capture workflows, global-period and anesthesia coding |
| Revenue Finance Manager | 81.55 | Multi-campus cost reports, capital post-implementation review, zero-based budgeting |
| Revenue 340B Program Manager | 81.20 | Orphan-drug exclusion, Medicare Part B modifier mechanics, ADR/CMP dispute workflow |
| Quality Compliance Officer | 81.15 | HIPAA breach exceptions, Stark failure modes, exclusion reinstatement controls |
| Healthcare Interoperability Engineer | 81.10 | SMART on FHIR auth/PKCE/JWT, HL7 ACK idempotency, TEFCA patient-matching governance |
| Quality Process Improvement Analyst | 80.85 | Managed-care QAPI (42 CFR 438.330), sentinel event RCA/CAPA, risk-adjusted outcomes |
| Revenue Cycle Specialist | 80.65 | 835/ERA posting controls, credit balance workflow, denial-type-specific appeal assembly |
| Revenue Contract Analyst | 80.45 | Contract build hierarchy, outpatient edit logic, prompt-pay and offset economics |
| Payer Managed Care Analyst | 80.30 | Medicaid directed payments, settlement reconciliation, MA bid-to-revenue bridge |
| Health Informatics Manager | 80.30 | FHIR/SMART production ops, public-health reporting controls, identity and downtime governance |
The Medical Coding Specialist saw the largest single-iteration gain at +11.00 points. The Compliance Officer improved +9.05 in one pass. Most agents required 1--3 iterations to cross 80.
What Was Added to Prompts
The eval loop does not add generic advice. It adds the specific knowledge that domain practitioners would expect and that the rubric penalizes when absent:
- CFR citations: 42 CFR 412.106(b) (DSH qualification), 42 CFR Part 419 (OPPS), 42 CFR 412.4(f) (transfer DRGs), 42 CFR 438.330 (managed-care QAPI), 45 CFR 164 (HIPAA Security Rule controls)
- Worked calculation examples: IME and DGME formulas with FTE rules, DSH patient-percentage calculations, OPPS APC payment formulas, transfer DRG per diem calculations
- Regulatory formulas: IRC 141/148 bond compliance, payer mix shift methodology, outlier payment reconciliation mechanics
- Audit process details: MAC audit lifecycle with PRRB appeal paths, reasonable collection effort criteria per CMS Pub 15-1 Section 310, HRSA ADR/CMP dispute-file requirements
- Debt covenant structures: Specific threshold structures, credit-balance and overpayment workflows, prompt-pay and offset modeling
- Payment mechanics: 835/ERA control points, Medicaid supplemental payment modeling, observation-status and 340B outpatient economics, Medicare Advantage bid/rebate/RAF revenue bridges
Infrastructure Shipped
Three pull requests delivered the full stack:
| PR | Title | Scope |
|---|---|---|
| #4 | Enrich all 51 agent prompts with examples and seeds | 53 files -- baseline prompt enrichment across every agent |
| #5 | Add eval calibration infrastructure and provider framework | 709 files -- rubric, scoring harness, provider framework, calibration tooling |
| #6 | Add orchestrator agent and lifecycle documentation | 11 files -- orchestrator design, lifecycle docs, design specs |
Merged separately: #7 landed the 10-agent improvements themselves (18 files, 759 additions).
Calibration Results
A pilot calibration run validated the scoring infrastructure before the improvement loop began:
- Mean calibration delta: +0.198 (scorer alignment improved significantly after rubric tuning)
- Lint pass rate: 0.48 to 0.88 (prompt structural quality nearly doubled)
The frozen rubric at eval/rubric.md locks the scoring weights so improvements are comparable across iterations and agents. Scores from different iterations of the same agent are not directly compared -- the same-question pre/post design within each iteration is the unit of measurement.
Design Decisions Worth Noting
Split-role architecture. A single model optimizing its own prompt tends to drift toward generic executive tone and lose domain edge. Separating the scorer (which identifies gaps and specifies what to preserve) from the editor (which patches the file) keeps prompts sharp.
Identity preservation. The scorer returns identity_to_preserve and anti_patterns_to_avoid alongside weak_areas. The editor is constrained to make the prompt more capable, not more average. This is why prompts gained specific CFR citations and payment formulas instead of broad best-practice boilerplate.
Line cap enforcement. Each agent file has a fixed line cap based on its baseline. Edits must fit within the cap. This forces compression and prioritization rather than unbounded growth.
Git ratchet. Every improvement is committed atomically. Every failed edit is reverted. The append-only eval/results.tsv log provides a complete audit trail. No regressions ship.
What Is Next
- 41 agents remaining. The loop is proven and repeatable. Target cadence is roughly 5 agents per week.
- 30+ agents at 80+ by mid-May. That threshold gives publishable coverage across all 10 agent categories (Revenue, Clinical, Quality, Payer, Operations, Health IT, Population Health, Pharmacy, Strategy, Emergency Preparedness).
- Second-pass depth. Agents already at 80 can run additional iterations targeting 85+ as the rubric surface area becomes well-understood.
- Cross-agent consistency. As more agents pass through the loop, patterns in weak areas will inform batch improvements to prompt architecture across the full set.