Skip to content

Stabilize and unify some test skills#501

Merged
Evangelink merged 7 commits into
mainfrom
dev/amauryleve/unify
Apr 8, 2026
Merged

Stabilize and unify some test skills#501
Evangelink merged 7 commits into
mainfrom
dev/amauryleve/unify

Conversation

@Evangelink
Copy link
Copy Markdown
Member

No description provided.

- Move platform-detection.md and filter-syntax.md to plugins/dotnet-test/shared/,
  removing 3 identical copies of each from run-tests, mtp-hot-reload, and
  migrate-vstest-to-mtp reference directories.

- Move dotnet.md from exp-test-smell-detection/extensions/ to shared/ as
  dotnet-test-frameworks.md in both dotnet-test and dotnet-experimental plugins.
  Update exp-assertion-quality, exp-test-boilerplate-detection, exp-test-tagging,
  and test-anti-patterns to reference the shared file instead of inlining
  framework detection tables.

- Differentiate test-anti-patterns (quick pragmatic review) from
  exp-test-smell-detection (deep formal audit with academic taxonomy) by updating
  descriptions and cross-referencing each other in When Not to Use sections.

- Update skill-validator to allow ../../shared/ file references while still
  blocking other parent-directory traversals. Add tests for the new rule.
Replace the plugin-level shared/ directories with non-invocable reference
skills (user-invocable: false) that other skills reference by name.

- Create platform-detection, filter-syntax, and dotnet-test-frameworks as
  hidden skills under plugins/dotnet-test/skills/. These contain the
  detection tables and syntax references previously duplicated across
  run-tests, mtp-hot-reload, and migrate-vstest-to-mtp.

- Create exp-dotnet-test-frameworks as a hidden skill under
  plugins/dotnet-experimental/skills/ for the experimental test analysis
  skills (exp-test-smell-detection, exp-assertion-quality, etc.).

- Update all consuming skills to reference these by skill name in backtick
  notation instead of file links.

- Revert the skill-validator ../../shared/ exception — no longer needed
  since all references now use the standard skill name mechanism.
exp-test-maintainability was only 6 calibration rules with no workflow.
exp-test-boilerplate-detection had the full 5-category detection workflow,
examples, calibration, and validation. Merge the boilerplate content into
exp-test-maintainability (the broader, more user-facing name) and add the
two unique maintainability rules (DisplayName guidance, DataRow vs
DynamicData preference) to Category 3.

- Replace exp-test-maintainability SKILL.md with the merged content
- Move test fixtures from exp-test-boilerplate-detection to exp-test-maintainability
- Merge eval.yaml scenarios (4 total: 2 from each original skill)
- Delete exp-test-boilerplate-detection skill and tests
- Update all cross-references in exp-test-smell-detection, dotnet-test-frameworks,
  exp-dotnet-test-frameworks, and CODEOWNERS
…ion analysis

Point users to exp-mock-usage-analysis from the Over-mocking entry and
to exp-test-maintainability from the Duplicate tests entry.
@Evangelink
Copy link
Copy Markdown
Member Author

/evaluate

github-actions Bot added a commit that referenced this pull request Apr 7, 2026
github-actions Bot added a commit that referenced this pull request Apr 7, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 7, 2026

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
test-anti-patterns Detect mixed severity anti-patterns in repository service tests 5.0/5 → 5.0/5 ✅ test-anti-patterns; tools: report_intent, skill / ⚠️ NOT ACTIVATED ✅ 0.06 [1]
test-anti-patterns Detect flakiness indicators and test coupling 2.7/5 → 4.7/5 🟢 ✅ test-anti-patterns; tools: report_intent, skill / ⚠️ NOT ACTIVATED ✅ 0.06
test-anti-patterns Detect duplicated tests and magic values 3.0/5 → 5.0/5 🟢 ✅ test-anti-patterns; tools: skill, report_intent / ✅ writing-mstest-tests; test-anti-patterns; tools: report_intent, skill ✅ 0.06
test-anti-patterns Recognize well-written tests without inventing false positives 2.0/5 → 5.0/5 🟢 ✅ test-anti-patterns; tools: report_intent, skill ✅ 0.06
exp-test-maintainability Recommend data-driven patterns with display names for unclear parameters 4.0/5 → 3.7/5 🔴 ⚠️ NOT ACTIVATED ✅ 0.11 [2]
exp-test-maintainability Recognize well-maintained tests that need minimal changes 4.3/5 → 4.7/5 🟢 ✅ exp-test-maintainability; tools: skill, report_intent / ✅ exp-test-maintainability; tools: report_intent, skill ✅ 0.11 [3]
exp-test-maintainability Detect repeated object construction and setup across test methods 3.0/5 → 4.3/5 🟢 ✅ exp-test-maintainability; tools: skill, glob / ✅ exp-test-maintainability; tools: skill ✅ 0.11
exp-test-maintainability Recognize tests with minimal boilerplate that need no refactoring 2.3/5 → 4.3/5 🟢 ✅ exp-test-maintainability; tools: skill / ✅ exp-test-maintainability; tools: skill, glob ✅ 0.11 [4]
exp-assertion-quality Identify low assertion diversity in equality-dominated test suite 4.0/5 → 5.0/5 🟢 ✅ exp-assertion-quality; tools: skill ✅ 0.12
exp-assertion-quality Flag assertion-free tests and trivial-only assertions 3.7/5 → 4.0/5 🟢 ✅ exp-assertion-quality; tools: skill ✅ 0.12 [5]
exp-assertion-quality Recognize well-diversified assertion usage 2.7/5 → 5.0/5 🟢 ✅ exp-assertion-quality; tools: skill ✅ 0.12
exp-assertion-quality Decline request to write new tests from scratch 2.3/5 ⏰ → 3.7/5 ⏰ 🟢 ℹ️ not activated (expected) ✅ 0.12 [6]
mtp-hot-reload Suggest hot reload for failing test in MTP project (SDK 9) 1.0/5 → 2.3/5 ⏰ 🟢 ✅ mtp-hot-reload; tools: skill ✅ 0.10
mtp-hot-reload Suggest hot reload for failing test in MTP project (SDK 10) 1.0/5 → 4.3/5 🟢 ✅ mtp-hot-reload; tools: skill, bash, create ✅ 0.10
mtp-hot-reload Enable hot reload when package already installed 2.0/5 → 5.0/5 🟢 ✅ mtp-hot-reload; tools: skill ✅ 0.10
mtp-hot-reload Suggest launchSettings.json configuration for hot reload 1.0/5 ⏰ → 4.0/5 🟢 ✅ mtp-hot-reload; tools: skill, bash, create, glob / ✅ mtp-hot-reload; tools: skill, bash, create ✅ 0.10
mtp-hot-reload Use dotnet run not dotnet test for hot reload 2.3/5 → 3.0/5 🟢 ✅ mtp-hot-reload; tools: skill / ✅ mtp-hot-reload; tools: skill, report_intent ✅ 0.10 [7]
mtp-hot-reload Negative: VSTest project cannot use MTP hot reload 1.7/5 → 2.3/5 🟢 ✅ mtp-hot-reload; tools: skill, create ✅ 0.10 [8]
mtp-hot-reload Run specific failing test with hot reload filter 1.0/5 → 5.0/5 🟢 ✅ mtp-hot-reload; tools: skill / ✅ mtp-hot-reload; platform-detection; tools: skill, bash ✅ 0.10
run-tests Run tests in a VSTest MSTest project 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill ✅ 0.18
run-tests Run tests with trx reporting on MTP project (SDK 9) 3.7/5 → 3.7/5 ✅ run-tests; tools: skill / ✅ run-tests; platform-detection; tools: skill ✅ 0.18 [9]
run-tests Run tests with blame-hang on MTP project (SDK 10) 2.7/5 → 2.0/5 ⏰ 🔴 ✅ run-tests; tools: skill, bash, edit / ⚠️ NOT ACTIVATED ✅ 0.18 [10]
run-tests Run tests in a multi-TFM project targeting a specific framework 1.7/5 → 4.0/5 🟢 ✅ run-tests; tools: skill, bash, read_bash, glob / ✅ run-tests; tools: skill, bash, glob ✅ 0.18
run-tests Filter MSTest tests by category on VSTest 5.0/5 → 5.0/5 ✅ run-tests; tools: skill, bash, glob / ✅ run-tests; tools: skill, bash ✅ 0.18 [11]
run-tests Filter NUnit tests by class name on VSTest 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, bash / ✅ run-tests; tools: bash, skill ✅ 0.18
run-tests Filter xUnit v3 tests by class on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, bash / ✅ run-tests; tools: skill ✅ 0.18
run-tests Filter xUnit v3 tests by trait on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view / ✅ run-tests; platform-detection; filter-syntax; tools: skill, view ✅ 0.18
run-tests Filter TUnit tests by class using treenode-filter 2.3/5 → 4.3/5 🟢 ✅ run-tests; tools: skill, bash / ✅ run-tests; filter-syntax; platform-detection; tools: skill, bash ✅ 0.18
run-tests Combine multiple filter criteria on VSTest MSTest 4.0/5 → 4.3/5 🟢 ⚠️ NOT ACTIVATED / ✅ run-tests; tools: skill, bash ✅ 0.18 [12]
run-tests MTP project on SDK 9 must use -- separator for args 1.0/5 → 4.3/5 🟢 ✅ run-tests; tools: skill ✅ 0.18 [13]
run-tests MTP project on SDK 10 passes args directly 3.0/5 → 4.0/5 🟢 ✅ run-tests; tools: skill ✅ 0.18 [14]
run-tests Detect test platform from Directory.Build.props 1.3/5 → 5.0/5 🟢 ✅ run-tests; tools: skill ✅ 0.18 [15]
run-tests Negative test: do not use MTP syntax for a VSTest project 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view / ⚠️ NOT ACTIVATED ✅ 0.18 [16]
migrate-vstest-to-mtp Migrate MSTest project from VSTest to Microsoft.Testing.Platform 4.7/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: skill, report_intent ✅ 0.07 [17]
migrate-vstest-to-mtp Migrate NUnit project from VSTest to Microsoft.Testing.Platform 2.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill ✅ 0.07
migrate-vstest-to-mtp Migrate xUnit.net v2 project from VSTest to Microsoft.Testing.Platform 1.7/5 → 4.3/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill, report_intent, bash, view / ✅ migrate-vstest-to-mtp; tools: skill ✅ 0.07
migrate-vstest-to-mtp Update Azure DevOps pipeline from VSTest task to MTP 2.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill ✅ 0.07
migrate-vstest-to-mtp Migrate MSTest.Sdk project that explicitly uses VSTest 3.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill ✅ 0.07
migrate-vstest-to-mtp Translate dotnet test VSTest arguments to MTP equivalents 4.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill, report_intent / ✅ migrate-vstest-to-mtp; tools: skill ✅ 0.07 [18]
migrate-vstest-to-mtp Handle exit code 8 when migrating from VSTest to MTP 3.3/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill ✅ 0.07
migrate-vstest-to-mtp Configure dotnet test MTP mode on .NET 10 SDK 2.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill ✅ 0.07
migrate-vstest-to-mtp Migrate xUnit.net VSTest filter syntax to MTP 1.3/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill ✅ 0.07
migrate-vstest-to-mtp Full VSTest to MTP migration plan for MSTest solution 4.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: skill, report_intent ✅ 0.07 [19]
exp-test-smell-detection Detect multiple test smells in order processing test suite 5.0/5 → 5.0/5 ✅ exp-test-smell-detection; tools: skill ✅ 0.06 [20]
exp-test-smell-detection Recognize well-written tests with no significant smells 2.7/5 → 4.7/5 🟢 ✅ exp-test-smell-detection; tools: skill ✅ 0.06
exp-test-smell-detection Recognize integration tests and avoid false positives for external resources 5.0/5 → 5.0/5 ✅ exp-test-smell-detection; tools: skill ✅ 0.06 [21]
exp-test-smell-detection Decline request to write new tests from scratch 4.3/5 → 4.7/5 🟢 ℹ️ not activated (expected) ✅ 0.06 [22]
exp-test-tagging Tag an untagged MSTest test suite 3.3/5 → 4.3/5 🟢 ✅ exp-test-tagging; tools: skill / ✅ exp-test-tagging; tools: skill, glob, task, read_agent ✅ 0.14 [23]
exp-test-tagging Tag an untagged xUnit test suite 4.0/5 → 4.7/5 🟢 ✅ exp-test-tagging; tools: skill, glob / ✅ exp-test-tagging; tools: skill, glob, grep ✅ 0.14
exp-test-tagging Tag an untagged NUnit test suite 3.3/5 → 4.7/5 🟢 ✅ exp-test-tagging; tools: skill / ✅ exp-test-tagging; tools: skill, glob ✅ 0.14
exp-test-tagging Audit test distribution without modifying files 4.0/5 → 5.0/5 🟢 ✅ exp-test-tagging; tools: skill / ✅ exp-test-tagging; tools: skill, glob ✅ 0.14
exp-test-tagging Decline request to write new tests 4.0/5 → 3.7/5 🔴 ℹ️ not activated (expected) ✅ 0.14 [24]

[1] ⚠️ High run-to-run variance (CV=4.66) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=0.84) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=10.92) — consider re-running with --runs 5
[4] ⚠️ High run-to-run variance (CV=0.55) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=0.96) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=3.01) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -23.1% due to: judgment, quality, tokens (111793 → 171384)
[7] ⚠️ High run-to-run variance (CV=0.65) — consider re-running with --runs 5
[8] ⚠️ High run-to-run variance (CV=1.08) — consider re-running with --runs 5
[9] ⚠️ High run-to-run variance (CV=1.27) — consider re-running with --runs 5
[10] ⚠️ High run-to-run variance (CV=0.62) — consider re-running with --runs 5
[11] ⚠️ High run-to-run variance (CV=10.81) — consider re-running with --runs 5
[12] ⚠️ High run-to-run variance (CV=1.31) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -16.6% due to: judgment, quality
[13] ⚠️ High run-to-run variance (CV=1.82) — consider re-running with --runs 5
[14] ⚠️ High run-to-run variance (CV=10.92) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -41.4% due to: judgment, tokens (265819 → 584220), quality, time (158.7s → 237.3s), tool calls (18 → 26)
[15] ⚠️ High run-to-run variance (CV=0.73) — consider re-running with --runs 5
[16] ⚠️ High run-to-run variance (CV=0.79) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -0.8% due to: tokens (29655 → 34602)
[17] ⚠️ High run-to-run variance (CV=15.62) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -1.1% due to: tokens (12832 → 36156), tool calls (0 → 1)
[18] ⚠️ High run-to-run variance (CV=2.33) — consider re-running with --runs 5
[19] ⚠️ High run-to-run variance (CV=1.12) — consider re-running with --runs 5
[20] ⚠️ High run-to-run variance (CV=0.74) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -2.1% due to: tokens (60191 → 104698), time (33.6s → 76.6s), tool calls (5 → 6)
[21] (Plugin) Quality unchanged but weighted score is -10.0% due to: tokens (51503 → 109935), tool calls (4 → 8), time (36.7s → 100.8s)
[22] ⚠️ High run-to-run variance (CV=0.64) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -15.3% due to: judgment, quality
[23] ⚠️ High run-to-run variance (CV=0.66) — consider re-running with --runs 5
[24] ⚠️ High run-to-run variance (CV=1.05) — consider re-running with --runs 5

timeout — run(s) hit the (120s, 180s, 300s, 360s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

Inline the critical SDK 10 detection signal (global.json test.runner)
directly in Step 1 instead of deferring entirely to the platform-detection
skill. This makes the distinction between SDK 10 (no -- separator) and
SDK 8/9 (requires -- separator) more prominent.

Add a quick detection summary table, strengthen the Common Pitfalls entry
for SDK 10 with a blame-hang-timeout example, and keep the
platform-detection skill reference for the full detection logic.
@Evangelink
Copy link
Copy Markdown
Member Author

/evaluate

github-actions Bot added a commit that referenced this pull request Apr 7, 2026
github-actions Bot added a commit that referenced this pull request Apr 7, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 7, 2026

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
test-anti-patterns Detect mixed severity anti-patterns in repository service tests 5.0/5 → 5.0/5 ✅ test-anti-patterns; tools: report_intent, skill / ⚠️ NOT ACTIVATED ✅ 0.06 [1]
test-anti-patterns Detect flakiness indicators and test coupling 2.7/5 → 4.7/5 🟢 ✅ test-anti-patterns; tools: report_intent, skill ✅ 0.06 [2]
test-anti-patterns Detect duplicated tests and magic values 3.0/5 → 5.0/5 🟢 ✅ test-anti-patterns; tools: report_intent, skill / ⚠️ NOT ACTIVATED ✅ 0.06
test-anti-patterns Recognize well-written tests without inventing false positives 2.0/5 → 5.0/5 🟢 ✅ test-anti-patterns; tools: report_intent, skill / ✅ test-anti-patterns; tools: skill, report_intent ✅ 0.06
exp-test-maintainability Recommend data-driven patterns with display names for unclear parameters 4.0/5 → 4.0/5 ⚠️ NOT ACTIVATED ✅ 0.08
exp-test-maintainability Recognize well-maintained tests that need minimal changes 5.0/5 → 5.0/5 ✅ exp-test-maintainability; tools: report_intent, skill ✅ 0.08 [3]
exp-test-maintainability Detect repeated object construction and setup across test methods 3.0/5 → 4.7/5 🟢 ✅ exp-test-maintainability; tools: skill, glob / ✅ exp-test-maintainability; tools: skill ✅ 0.08 [4]
exp-test-maintainability Recognize tests with minimal boilerplate that need no refactoring 4.0/5 → 4.7/5 🟢 ✅ exp-test-maintainability; tools: skill ✅ 0.08 [5]
exp-assertion-quality Identify low assertion diversity in equality-dominated test suite 4.7/5 → 5.0/5 🟢 ✅ exp-assertion-quality; tools: skill / ✅ exp-assertion-quality; tools: skill, glob ✅ 0.11 [6]
exp-assertion-quality Flag assertion-free tests and trivial-only assertions 3.0/5 → 4.0/5 🟢 ✅ exp-assertion-quality; tools: skill ✅ 0.11
exp-assertion-quality Recognize well-diversified assertion usage 3.0/5 → 4.3/5 🟢 ✅ exp-assertion-quality; tools: skill / ✅ exp-assertion-quality; tools: skill, glob ✅ 0.11
exp-assertion-quality Decline request to write new tests from scratch 1.7/5 ⏰ → 1.3/5 ⏰ 🔴 ℹ️ not activated (expected) ✅ 0.11 [7]
mtp-hot-reload Suggest hot reload for failing test in MTP project (SDK 9) 1.0/5 → 2.7/5 ⏰ 🟢 ✅ mtp-hot-reload; tools: skill, read_bash ✅ 0.14
mtp-hot-reload Suggest hot reload for failing test in MTP project (SDK 10) 1.0/5 → 4.7/5 🟢 ✅ mtp-hot-reload; tools: skill, bash, create, glob / ✅ mtp-hot-reload; tools: skill, bash, create ✅ 0.14
mtp-hot-reload Enable hot reload when package already installed 2.0/5 → 5.0/5 🟢 ✅ mtp-hot-reload; tools: skill ✅ 0.14
mtp-hot-reload Suggest launchSettings.json configuration for hot reload 1.0/5 → 4.0/5 🟢 ✅ mtp-hot-reload; tools: skill, glob, create, bash / ✅ mtp-hot-reload; tools: skill, create, bash ✅ 0.14
mtp-hot-reload Use dotnet run not dotnet test for hot reload 2.3/5 → 3.3/5 🟢 ✅ mtp-hot-reload; tools: skill ✅ 0.14 [8]
mtp-hot-reload Negative: VSTest project cannot use MTP hot reload 1.7/5 → 2.7/5 🟢 ✅ mtp-hot-reload; tools: skill, create ✅ 0.14 [9]
mtp-hot-reload Run specific failing test with hot reload filter 1.0/5 → 5.0/5 🟢 ✅ mtp-hot-reload; tools: skill / ✅ mtp-hot-reload; platform-detection; tools: skill, bash ✅ 0.14
run-tests Run tests in a VSTest MSTest project 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, glob ✅ 0.18
run-tests Run tests with trx reporting on MTP project (SDK 9) 3.7/5 → 3.3/5 ⏰ 🔴 ✅ run-tests; tools: skill / ✅ run-tests; tools: skill, glob ✅ 0.18 [10]
run-tests Run tests with blame-hang on MTP project (SDK 10) 2.0/5 → 2.7/5 🟢 ✅ run-tests; tools: skill, bash, edit / ⚠️ NOT ACTIVATED ✅ 0.18 [11]
run-tests Run tests in a multi-TFM project targeting a specific framework 2.0/5 → 4.3/5 🟢 ✅ run-tests; tools: skill, glob, bash, read_bash / ✅ run-tests; tools: skill, glob, bash ✅ 0.18
run-tests Filter MSTest tests by category on VSTest 5.0/5 → 5.0/5 ✅ run-tests; tools: skill, glob / ✅ run-tests; tools: skill ✅ 0.18 [12]
run-tests Filter NUnit tests by class name on VSTest 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, glob / ✅ run-tests; tools: skill, bash ✅ 0.18 [13]
run-tests Filter xUnit v3 tests by class on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, bash / ✅ run-tests; tools: skill ✅ 0.18 [14]
run-tests Filter xUnit v3 tests by trait on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view ✅ 0.18
run-tests Filter TUnit tests by class using treenode-filter 3.0/5 → 4.0/5 🟢 ✅ run-tests; tools: skill, bash / ⚠️ NOT ACTIVATED ✅ 0.18 [15]
run-tests Combine multiple filter criteria on VSTest MSTest 4.7/5 → 4.7/5 ✅ run-tests; tools: skill / ✅ run-tests; tools: skill, glob ✅ 0.18 [16]
run-tests MTP project on SDK 9 must use -- separator for args 1.3/5 → 4.3/5 🟢 ✅ run-tests; tools: skill / ✅ run-tests; tools: edit, skill ✅ 0.18
run-tests MTP project on SDK 10 passes args directly 1.7/5 ⏰ → 2.3/5 ⏰ 🟢 ✅ run-tests; tools: skill, create / ⚠️ NOT ACTIVATED ✅ 0.18 [17]
run-tests Detect test platform from Directory.Build.props 2.3/5 → 5.0/5 🟢 ✅ run-tests; tools: skill ✅ 0.18 [18]
run-tests Negative test: do not use MTP syntax for a VSTest project 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view, glob ✅ 0.18 [19]
migrate-vstest-to-mtp Migrate MSTest project from VSTest to Microsoft.Testing.Platform 3.7/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: report_intent, skill ✅ 0.11 [20]
migrate-vstest-to-mtp Migrate NUnit project from VSTest to Microsoft.Testing.Platform 1.7/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: skill, report_intent ✅ 0.11
migrate-vstest-to-mtp Migrate xUnit.net v2 project from VSTest to Microsoft.Testing.Platform 2.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill, report_intent, glob, bash, view / ✅ migrate-vstest-to-mtp; tools: skill, report_intent, view, bash ✅ 0.11
migrate-vstest-to-mtp Update Azure DevOps pipeline from VSTest task to MTP 2.3/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: skill, report_intent ✅ 0.11
migrate-vstest-to-mtp Migrate MSTest.Sdk project that explicitly uses VSTest 3.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill ✅ 0.11
migrate-vstest-to-mtp Translate dotnet test VSTest arguments to MTP equivalents 4.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill ✅ 0.11 [21]
migrate-vstest-to-mtp Handle exit code 8 when migrating from VSTest to MTP 3.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill, glob / ✅ migrate-vstest-to-mtp; tools: skill ✅ 0.11 [22]
migrate-vstest-to-mtp Configure dotnet test MTP mode on .NET 10 SDK 2.0/5 → 4.7/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill ✅ 0.11
migrate-vstest-to-mtp Migrate xUnit.net VSTest filter syntax to MTP 1.3/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill ✅ 0.11
migrate-vstest-to-mtp Full VSTest to MTP migration plan for MSTest solution 4.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill ✅ 0.11 [23]
exp-test-smell-detection Detect multiple test smells in order processing test suite 4.3/5 → 5.0/5 🟢 ✅ exp-test-smell-detection; tools: skill ✅ 0.05 [24]
exp-test-smell-detection Recognize well-written tests with no significant smells 3.3/5 → 4.7/5 🟢 ✅ exp-test-smell-detection; tools: skill ✅ 0.05
exp-test-smell-detection Recognize integration tests and avoid false positives for external resources 5.0/5 → 5.0/5 ✅ exp-test-smell-detection; tools: skill ✅ 0.05 [25]
exp-test-smell-detection Decline request to write new tests from scratch 4.7/5 → 4.7/5 ℹ️ not activated (expected) ✅ 0.05 [26]
exp-test-tagging Tag an untagged MSTest test suite 3.7/5 → 4.7/5 🟢 ✅ exp-test-tagging; tools: skill, glob 🟡 0.26
exp-test-tagging Tag an untagged xUnit test suite 3.7/5 → 5.0/5 🟢 ✅ exp-test-tagging; tools: skill, glob / ✅ exp-test-tagging; tools: skill, task, glob, read_agent 🟡 0.26
exp-test-tagging Tag an untagged NUnit test suite 4.0/5 → 4.7/5 🟢 ✅ exp-test-tagging; tools: skill, glob 🟡 0.26
exp-test-tagging Audit test distribution without modifying files 4.0/5 → 5.0/5 🟢 ✅ exp-test-tagging; tools: skill / ✅ exp-test-tagging; tools: skill, glob 🟡 0.26 [27]
exp-test-tagging Decline request to write new tests 4.0/5 → 4.0/5 ℹ️ not activated (expected) 🟡 0.26 [28]

[1] (Plugin) Quality unchanged but weighted score is -3.2% due to: tokens (13312 → 17063), time (18.4s → 26.3s), quality
[2] ⚠️ High run-to-run variance (CV=0.91) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=1.17) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -14.9% due to: judgment, tokens (20408 → 41086), tool calls (0 → 1), time (29.1s → 50.1s)
[4] ⚠️ High run-to-run variance (CV=9.10) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -2.0% due to: errors (0 → 1), time (80.3s → 105.6s)
[5] ⚠️ High run-to-run variance (CV=0.78) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=32.58) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -0.8% due to: tokens (47765 → 88954), time (43.7s → 93.5s), tool calls (4 → 7)
[7] ⚠️ High run-to-run variance (CV=1.13) — consider re-running with --runs 5
[8] ⚠️ High run-to-run variance (CV=0.66) — consider re-running with --runs 5
[9] ⚠️ High run-to-run variance (CV=1.61) — consider re-running with --runs 5
[10] ⚠️ High run-to-run variance (CV=1.65) — consider re-running with --runs 5. (Plugin) Quality dropped but weighted score is +3.4% due to: efficiency metrics
[11] ⚠️ High run-to-run variance (CV=0.75) — consider re-running with --runs 5
[12] (Plugin) Quality unchanged but weighted score is -5.0% due to: tokens (36232 → 59215), quality, time (28.2s → 35.0s)
[13] ⚠️ High run-to-run variance (CV=1.54) — consider re-running with --runs 5
[14] ⚠️ High run-to-run variance (CV=0.57) — consider re-running with --runs 5
[15] ⚠️ High run-to-run variance (CV=0.99) — consider re-running with --runs 5
[16] ⚠️ High run-to-run variance (CV=1.87) — consider re-running with --runs 5
[17] ⚠️ High run-to-run variance (CV=7.83) — consider re-running with --runs 5
[18] ⚠️ High run-to-run variance (CV=12.14) — consider re-running with --runs 5
[19] ⚠️ High run-to-run variance (CV=0.99) — consider re-running with --runs 5
[20] ⚠️ High run-to-run variance (CV=0.59) — consider re-running with --runs 5
[21] ⚠️ High run-to-run variance (CV=2.10) — consider re-running with --runs 5
[22] ⚠️ High run-to-run variance (CV=0.52) — consider re-running with --runs 5
[23] ⚠️ High run-to-run variance (CV=1.36) — consider re-running with --runs 5
[24] ⚠️ High run-to-run variance (CV=1.45) — consider re-running with --runs 5
[25] ⚠️ High run-to-run variance (CV=2.00) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -8.0% due to: tokens (45583 → 78530), time (41.1s → 104.7s), tool calls (4 → 7)
[26] ⚠️ High run-to-run variance (CV=5.81) — consider re-running with --runs 5
[27] ⚠️ High run-to-run variance (CV=3.78) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -64.6% due to: quality, judgment, tokens (86736 → 119886), time (70.2s → 116.4s), tool calls (10 → 14)
[28] ⚠️ High run-to-run variance (CV=1.33) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -54.3% due to: judgment, quality

timeout — run(s) hit the (120s, 300s, 360s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

- exp-test-maintainability: Add 'suggest a better test structure',
  'consolidate similar test methods', 'convert copy-paste tests to
  data-driven parameterized tests' to match prompts like 'each new case
  needs a whole new method, suggest a better structure'.

- test-anti-patterns: Add 'review tests', 'find test problems', 'check
  test quality', 'audit tests for common mistakes' to match review-style
  prompts that don't use the word 'anti-pattern'.

- run-tests: Add 'hang timeout', 'blame-hang', 'blame-crash', 'TUnit'
  to match SDK 10 blame scenarios and TUnit filter scenarios that were
  intermittently not activating.
@Evangelink
Copy link
Copy Markdown
Member Author

/evaluate

@Evangelink Evangelink marked this pull request as ready for review April 7, 2026 15:53
Copilot AI review requested due to automatic review settings April 7, 2026 15:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR consolidates duplicated “test boilerplate detection” content into the existing exp-test-maintainability evaluation, and restructures .NET testing reference material by promoting platform/filter/framework references into dedicated (non-invocable) skills that other skills can link to.

Changes:

  • Moved the heavy/minimal boilerplate evaluation scenarios into tests/dotnet-experimental/exp-test-maintainability/eval.yaml and removed exp-test-boilerplate-detection eval/skill content.
  • Added new test fixture projects (Calculator.Tests, OrderService.Tests) to support maintainability scenarios.
  • Refactored skill documentation to reference shared platform-detection, filter-syntax, and test-framework reference skills, and updated CODEOWNERS accordingly.
Show a summary per file
File Description
tests/dotnet-experimental/exp-test-maintainability/fixtures/minimal-boilerplate/Calculator.Tests/CalculatorTests.cs Adds “minimal boilerplate” MSTest sample tests used as an evaluation fixture.
tests/dotnet-experimental/exp-test-maintainability/fixtures/minimal-boilerplate/Calculator.Tests/Calculator.Tests.csproj Adds MSTest fixture project metadata for the minimal-boilerplate scenario.
tests/dotnet-experimental/exp-test-maintainability/fixtures/heavy-boilerplate/OrderService.Tests/OrderService.Tests.csproj Adds MSTest fixture project metadata for the heavy-boilerplate scenario.
tests/dotnet-experimental/exp-test-maintainability/fixtures/heavy-boilerplate/OrderService.Tests/OrderProcessorTests.cs Adds “heavy boilerplate” test fixture to drive maintainability recommendations.
tests/dotnet-experimental/exp-test-maintainability/eval.yaml Adds new scenarios for heavy/minimal boilerplate and related rubric/assertions.
tests/dotnet-experimental/exp-test-boilerplate-detection/eval.yaml Removes now-redundant eval scenarios after consolidation.
plugins/dotnet-test/skills/test-anti-patterns/SKILL.md Updates positioning and cross-links to related skills/framework reference skill.
plugins/dotnet-test/skills/run-tests/SKILL.md Updates detection guidance and references platform-detection / filter-syntax skills.
plugins/dotnet-test/skills/run-tests/references/platform-detection.md Deleted in favor of the platform-detection skill.
plugins/dotnet-test/skills/platform-detection/SKILL.md Adds skill frontmatter to make platform detection a shared reference skill.
plugins/dotnet-test/skills/mtp-hot-reload/SKILL.md Updates references to the centralized platform-detection / filter-syntax skills.
plugins/dotnet-test/skills/mtp-hot-reload/references/filter-syntax.md Deleted in favor of the filter-syntax skill.
plugins/dotnet-test/skills/migrate-vstest-to-mtp/SKILL.md Updates references to centralized platform/filter reference skills.
plugins/dotnet-test/skills/migrate-vstest-to-mtp/references/platform-detection.md Deleted in favor of the platform-detection skill.
plugins/dotnet-test/skills/migrate-vstest-to-mtp/references/filter-syntax.md Deleted in favor of the filter-syntax skill.
plugins/dotnet-test/skills/filter-syntax/SKILL.md Adds skill frontmatter to make filter syntax a shared reference skill.
plugins/dotnet-test/skills/dotnet-test-frameworks/SKILL.md Adds a centralized, non-invocable framework reference skill for non-experimental skills.
plugins/dotnet-experimental/skills/exp-test-tagging/SKILL.md Switches framework detection guidance to reference exp-dotnet-test-frameworks.
plugins/dotnet-experimental/skills/exp-test-smell-detection/SKILL.md Updates references from old boilerplate skill to exp-test-maintainability and framework reference skill.
plugins/dotnet-experimental/skills/exp-test-maintainability/SKILL.md Expands/clarifies maintainability scope to include duplication/boilerplate detection.
plugins/dotnet-experimental/skills/exp-test-boilerplate-detection/SKILL.md Removes redundant skill after consolidation into exp-test-maintainability.
plugins/dotnet-experimental/skills/exp-dotnet-test-frameworks/SKILL.md Adds skill frontmatter to make it a shared reference skill for experimental skills.
plugins/dotnet-experimental/skills/exp-assertion-quality/SKILL.md Updates scanning guidance to reference exp-dotnet-test-frameworks.
.github/CODEOWNERS Updates ownership entries to reflect added/removed experimental skill paths.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 20/24 changed files
  • Comments generated: 0

github-actions Bot added a commit that referenced this pull request Apr 7, 2026
github-actions Bot added a commit that referenced this pull request Apr 7, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 7, 2026

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
test-anti-patterns Detect mixed severity anti-patterns in repository service tests 5.0/5 → 5.0/5 ✅ test-anti-patterns; tools: report_intent, skill / ⚠️ NOT ACTIVATED ✅ 0.06 [1]
test-anti-patterns Detect flakiness indicators and test coupling 3.0/5 → 5.0/5 🟢 ✅ test-anti-patterns; tools: report_intent, skill / ⚠️ NOT ACTIVATED ✅ 0.06
test-anti-patterns Detect duplicated tests and magic values 3.0/5 → 5.0/5 🟢 ✅ test-anti-patterns; tools: skill, report_intent / ⚠️ NOT ACTIVATED ✅ 0.06 [2]
test-anti-patterns Recognize well-written tests without inventing false positives 2.0/5 → 5.0/5 🟢 ✅ test-anti-patterns; tools: report_intent, skill / ✅ test-anti-patterns; tools: skill, report_intent ✅ 0.06
exp-test-maintainability Recommend data-driven patterns with display names for unclear parameters 4.0/5 → 4.3/5 🟢 ✅ exp-test-maintainability; tools: report_intent, skill / ⚠️ NOT ACTIVATED ✅ 0.11 [3]
exp-test-maintainability Recognize well-maintained tests that need minimal changes 4.3/5 → 5.0/5 🟢 ⚠️ NOT ACTIVATED / ✅ exp-test-maintainability; tools: report_intent, skill ✅ 0.11 [4]
exp-test-maintainability Detect repeated object construction and setup across test methods 3.0/5 → 4.7/5 🟢 ✅ exp-test-maintainability; tools: skill, glob / ✅ exp-test-maintainability; tools: skill ✅ 0.11
exp-test-maintainability Recognize tests with minimal boilerplate that need no refactoring 3.3/5 → 4.7/5 🟢 ✅ exp-test-maintainability; tools: skill ✅ 0.11 [5]
exp-assertion-quality Identify low assertion diversity in equality-dominated test suite 3.3/5 ⏰ → 4.7/5 ⏰ 🟢 ✅ exp-assertion-quality; tools: skill / ✅ exp-assertion-quality; tools: skill, glob ✅ 0.09 [6]
exp-assertion-quality Flag assertion-free tests and trivial-only assertions 3.3/5 → 4.3/5 🟢 ✅ exp-assertion-quality; tools: skill ✅ 0.09 [7]
exp-assertion-quality Recognize well-diversified assertion usage 3.0/5 → 4.7/5 🟢 ✅ exp-assertion-quality; tools: skill ✅ 0.09
exp-assertion-quality Decline request to write new tests from scratch 1.7/5 ⏰ → 1.0/5 ⏰ 🔴 ℹ️ not activated (expected) ✅ 0.09 [8]
mtp-hot-reload Suggest hot reload for failing test in MTP project (SDK 9) 1.0/5 → 2.7/5 ⏰ 🟢 ✅ mtp-hot-reload; tools: skill, report_intent, view, bash, read_bash, edit, glob ✅ 0.10
mtp-hot-reload Suggest hot reload for failing test in MTP project (SDK 10) 1.0/5 → 4.0/5 🟢 ✅ mtp-hot-reload; tools: skill, bash, create, glob / ✅ mtp-hot-reload; tools: skill, bash, create ✅ 0.10
mtp-hot-reload Enable hot reload when package already installed 2.0/5 → 5.0/5 🟢 ✅ mtp-hot-reload; tools: skill ✅ 0.10
mtp-hot-reload Suggest launchSettings.json configuration for hot reload 1.0/5 → 4.0/5 🟢 ✅ mtp-hot-reload; tools: skill, bash, create ✅ 0.10
mtp-hot-reload Use dotnet run not dotnet test for hot reload 2.3/5 → 3.3/5 🟢 ✅ mtp-hot-reload; tools: skill / ✅ mtp-hot-reload; tools: skill, report_intent ✅ 0.10 [9]
mtp-hot-reload Negative: VSTest project cannot use MTP hot reload 1.7/5 → 2.0/5 🟢 ✅ mtp-hot-reload; tools: skill, create ✅ 0.10
mtp-hot-reload Run specific failing test with hot reload filter 1.0/5 → 4.0/5 🟢 ✅ mtp-hot-reload; tools: skill ✅ 0.10
run-tests Run tests in a VSTest MSTest project 4.0/5 → 4.0/5 ✅ run-tests; tools: skill, glob ✅ 0.14 [10]
run-tests Run tests with trx reporting on MTP project (SDK 9) 3.7/5 → 3.3/5 ⏰ 🔴 ✅ run-tests; tools: skill ✅ 0.14 [11]
run-tests Run tests with blame-hang on MTP project (SDK 10) 2.3/5 → 4.0/5 ⏰ 🟢 ✅ run-tests; tools: skill, bash, edit ✅ 0.14 [12]
run-tests Run tests in a multi-TFM project targeting a specific framework 2.0/5 → 4.3/5 🟢 ✅ run-tests; tools: bash, skill, glob / ⚠️ NOT ACTIVATED ✅ 0.14
run-tests Filter MSTest tests by category on VSTest 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED ✅ 0.14 [13]
run-tests Filter NUnit tests by class name on VSTest 3.7/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, glob, bash / ⚠️ NOT ACTIVATED ✅ 0.14
run-tests Filter xUnit v3 tests by class on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.14 [14]
run-tests Filter xUnit v3 tests by trait on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view ✅ 0.14
run-tests Filter TUnit tests by class using treenode-filter 1.7/5 → 4.3/5 🟢 ✅ run-tests; tools: skill, glob, bash / ⚠️ NOT ACTIVATED ✅ 0.14
run-tests Combine multiple filter criteria on VSTest MSTest 4.7/5 → 4.3/5 🔴 ⚠️ NOT ACTIVATED / ✅ run-tests; tools: skill ✅ 0.14 [15]
run-tests MTP project on SDK 9 must use -- separator for args 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill ✅ 0.14
run-tests MTP project on SDK 10 passes args directly 2.7/5 → 3.3/5 🟢 ✅ run-tests; tools: skill ✅ 0.14 [16]
run-tests Detect test platform from Directory.Build.props 1.3/5 → 5.0/5 🟢 ✅ run-tests; tools: skill ✅ 0.14 [17]
run-tests Negative test: do not use MTP syntax for a VSTest project 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view, glob / ✅ run-tests; tools: skill, view ✅ 0.14 [18]
migrate-vstest-to-mtp Migrate MSTest project from VSTest to Microsoft.Testing.Platform 4.7/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: report_intent, skill ✅ 0.07 [19]
migrate-vstest-to-mtp Migrate NUnit project from VSTest to Microsoft.Testing.Platform 2.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill ✅ 0.07
migrate-vstest-to-mtp Migrate xUnit.net v2 project from VSTest to Microsoft.Testing.Platform 2.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill, report_intent, glob, bash, view / ✅ migrate-vstest-to-mtp; tools: skill ✅ 0.07
migrate-vstest-to-mtp Update Azure DevOps pipeline from VSTest task to MTP 2.7/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill ✅ 0.07
migrate-vstest-to-mtp Migrate MSTest.Sdk project that explicitly uses VSTest 3.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill ✅ 0.07
migrate-vstest-to-mtp Translate dotnet test VSTest arguments to MTP equivalents 4.3/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill, report_intent / ✅ migrate-vstest-to-mtp; tools: skill ✅ 0.07 [20]
migrate-vstest-to-mtp Handle exit code 8 when migrating from VSTest to MTP 3.0/5 → 4.7/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: stop_bash, skill ✅ 0.07 [21]
migrate-vstest-to-mtp Configure dotnet test MTP mode on .NET 10 SDK 2.0/5 → 4.7/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill ✅ 0.07
migrate-vstest-to-mtp Migrate xUnit.net VSTest filter syntax to MTP 2.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill ✅ 0.07 [22]
migrate-vstest-to-mtp Full VSTest to MTP migration plan for MSTest solution 2.7/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill ✅ 0.07 [23]
exp-test-smell-detection Detect multiple test smells in order processing test suite 4.0/5 → 5.0/5 🟢 ✅ exp-test-smell-detection; tools: skill ✅ 0.06
exp-test-smell-detection Recognize well-written tests with no significant smells 2.7/5 → 4.7/5 🟢 ✅ exp-test-smell-detection; tools: skill ✅ 0.06
exp-test-smell-detection Recognize integration tests and avoid false positives for external resources 5.0/5 → 5.0/5 ✅ exp-test-smell-detection; tools: skill ✅ 0.06 [24]
exp-test-smell-detection Decline request to write new tests from scratch 4.7/5 → 4.7/5 ℹ️ not activated (expected) ✅ 0.06 [25]
exp-test-tagging Tag an untagged MSTest test suite 3.7/5 → 4.7/5 🟢 ✅ exp-test-tagging; tools: skill, glob / ✅ exp-test-tagging; tools: skill, glob, task, read_agent ✅ 0.17
exp-test-tagging Tag an untagged xUnit test suite 3.7/5 → 4.3/5 🟢 ✅ exp-test-tagging; tools: skill, glob / ✅ exp-test-tagging; tools: skill, task, read_agent, grep ✅ 0.17 [26]
exp-test-tagging Tag an untagged NUnit test suite 4.0/5 → 5.0/5 🟢 ✅ exp-test-tagging; tools: skill, glob ✅ 0.17
exp-test-tagging Audit test distribution without modifying files 4.0/5 → 5.0/5 🟢 ✅ exp-test-tagging; tools: skill, glob ✅ 0.17
exp-test-tagging Decline request to write new tests 4.3/5 → 4.0/5 🔴 ℹ️ not activated (expected) ✅ 0.17 [27]

[1] ⚠️ High run-to-run variance (CV=9.99) — consider re-running with --runs 5. (Plugin) Quality dropped but weighted score is +5.6% due to: completion (✗ → ✓)
[2] ⚠️ High run-to-run variance (CV=0.58) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=1.26) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -1.6% due to: tokens (18774 → 23110)
[4] ⚠️ High run-to-run variance (CV=1.47) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=0.90) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=1.21) — consider re-running with --runs 5
[7] ⚠️ High run-to-run variance (CV=0.52) — consider re-running with --runs 5
[8] ⚠️ High run-to-run variance (CV=1.58) — consider re-running with --runs 5
[9] ⚠️ High run-to-run variance (CV=0.91) — consider re-running with --runs 5
[10] ⚠️ High run-to-run variance (CV=0.59) — consider re-running with --runs 5
[11] ⚠️ High run-to-run variance (CV=21.88) — consider re-running with --runs 5
[12] ⚠️ High run-to-run variance (CV=4.73) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -60.0% due to: judgment, quality, tokens (34756 → 250897), tool calls (4 → 12), time (23.8s → 121.2s)
[13] ⚠️ High run-to-run variance (CV=1.00) — consider re-running with --runs 5
[14] ⚠️ High run-to-run variance (CV=0.55) — consider re-running with --runs 5
[15] ⚠️ High run-to-run variance (CV=5.30) — consider re-running with --runs 5
[16] ⚠️ High run-to-run variance (CV=1.46) — consider re-running with --runs 5
[17] ⚠️ High run-to-run variance (CV=1.70) — consider re-running with --runs 5
[18] ⚠️ High run-to-run variance (CV=1.02) — consider re-running with --runs 5
[19] ⚠️ High run-to-run variance (CV=10.48) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -0.9% due to: tokens (12739 → 36224), tool calls (0 → 2)
[20] ⚠️ High run-to-run variance (CV=2.48) — consider re-running with --runs 5
[21] ⚠️ High run-to-run variance (CV=0.51) — consider re-running with --runs 5
[22] ⚠️ High run-to-run variance (CV=0.54) — consider re-running with --runs 5
[23] ⚠️ High run-to-run variance (CV=0.59) — consider re-running with --runs 5
[24] ⚠️ High run-to-run variance (CV=4.00) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -7.8% due to: tokens (47540 → 83581), tool calls (4 → 8), time (34.6s → 89.3s)
[25] ⚠️ High run-to-run variance (CV=0.78) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -30.6% due to: quality, judgment, tokens (150151 → 213552), time (79.8s → 97.0s)
[26] ⚠️ High run-to-run variance (CV=0.56) — consider re-running with --runs 5
[27] ⚠️ High run-to-run variance (CV=2.36) — consider re-running with --runs 5

timeout — run(s) hit the (120s, 300s, 360s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

@Evangelink Evangelink merged commit 93dc33d into main Apr 8, 2026
36 checks passed
@Evangelink Evangelink deleted the dev/amauryleve/unify branch April 8, 2026 11:47
sayedihashimi pushed a commit to sayedihashimi/skills that referenced this pull request Apr 20, 2026
* Deduplicate test skill references and clarify skill boundaries

- Move platform-detection.md and filter-syntax.md to plugins/dotnet-test/shared/,
  removing 3 identical copies of each from run-tests, mtp-hot-reload, and
  migrate-vstest-to-mtp reference directories.

- Move dotnet.md from exp-test-smell-detection/extensions/ to shared/ as
  dotnet-test-frameworks.md in both dotnet-test and dotnet-experimental plugins.
  Update exp-assertion-quality, exp-test-boilerplate-detection, exp-test-tagging,
  and test-anti-patterns to reference the shared file instead of inlining
  framework detection tables.

- Differentiate test-anti-patterns (quick pragmatic review) from
  exp-test-smell-detection (deep formal audit with academic taxonomy) by updating
  descriptions and cross-referencing each other in When Not to Use sections.

- Update skill-validator to allow ../../shared/ file references while still
  blocking other parent-directory traversals. Add tests for the new rule.

* Switch from shared/ directories to hidden reference skills

Replace the plugin-level shared/ directories with non-invocable reference
skills (user-invocable: false) that other skills reference by name.

- Create platform-detection, filter-syntax, and dotnet-test-frameworks as
  hidden skills under plugins/dotnet-test/skills/. These contain the
  detection tables and syntax references previously duplicated across
  run-tests, mtp-hot-reload, and migrate-vstest-to-mtp.

- Create exp-dotnet-test-frameworks as a hidden skill under
  plugins/dotnet-experimental/skills/ for the experimental test analysis
  skills (exp-test-smell-detection, exp-assertion-quality, etc.).

- Update all consuming skills to reference these by skill name in backtick
  notation instead of file links.

- Revert the skill-validator ../../shared/ exception — no longer needed
  since all references now use the standard skill name mechanism.

* Merge exp-test-boilerplate-detection into exp-test-maintainability

exp-test-maintainability was only 6 calibration rules with no workflow.
exp-test-boilerplate-detection had the full 5-category detection workflow,
examples, calibration, and validation. Merge the boilerplate content into
exp-test-maintainability (the broader, more user-facing name) and add the
two unique maintainability rules (DisplayName guidance, DataRow vs
DynamicData preference) to Category 3.

- Replace exp-test-maintainability SKILL.md with the merged content
- Move test fixtures from exp-test-boilerplate-detection to exp-test-maintainability
- Merge eval.yaml scenarios (4 total: 2 from each original skill)
- Delete exp-test-boilerplate-detection skill and tests
- Update all cross-references in exp-test-smell-detection, dotnet-test-frameworks,
  exp-dotnet-test-frameworks, and CODEOWNERS

* Add cross-references to test-anti-patterns for deep mock and duplication analysis

Point users to exp-mock-usage-analysis from the Over-mocking entry and
to exp-test-maintainability from the Duplicate tests entry.

* Add exp-dotnet-test-frameworks to CODEOWNERS

* Improve run-tests SDK 10 MTP detection for blame-hang scenario

Inline the critical SDK 10 detection signal (global.json test.runner)
directly in Step 1 instead of deferring entirely to the platform-detection
skill. This makes the distinction between SDK 10 (no -- separator) and
SDK 8/9 (requires -- separator) more prominent.

Add a quick detection summary table, strengthen the Common Pitfalls entry
for SDK 10 with a blame-hang-timeout example, and keep the
platform-detection skill reference for the full detection logic.

* Improve skill activation keywords in descriptions

- exp-test-maintainability: Add 'suggest a better test structure',
  'consolidate similar test methods', 'convert copy-paste tests to
  data-driven parameterized tests' to match prompts like 'each new case
  needs a whole new method, suggest a better structure'.

- test-anti-patterns: Add 'review tests', 'find test problems', 'check
  test quality', 'audit tests for common mistakes' to match review-style
  prompts that don't use the word 'anti-pattern'.

- run-tests: Add 'hang timeout', 'blame-hang', 'blame-crash', 'TUnit'
  to match SDK 10 blame scenarios and TUnit filter scenarios that were
  intermittently not activating.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants