Stabilize and unify some test skills by Evangelink · Pull Request #501 · dotnet/skills

Evangelink · 2026-04-07T12:51:35Z

No description provided.

- Move platform-detection.md and filter-syntax.md to plugins/dotnet-test/shared/, removing 3 identical copies of each from run-tests, mtp-hot-reload, and migrate-vstest-to-mtp reference directories. - Move dotnet.md from exp-test-smell-detection/extensions/ to shared/ as dotnet-test-frameworks.md in both dotnet-test and dotnet-experimental plugins. Update exp-assertion-quality, exp-test-boilerplate-detection, exp-test-tagging, and test-anti-patterns to reference the shared file instead of inlining framework detection tables. - Differentiate test-anti-patterns (quick pragmatic review) from exp-test-smell-detection (deep formal audit with academic taxonomy) by updating descriptions and cross-referencing each other in When Not to Use sections. - Update skill-validator to allow ../../shared/ file references while still blocking other parent-directory traversals. Add tests for the new rule.

Replace the plugin-level shared/ directories with non-invocable reference skills (user-invocable: false) that other skills reference by name. - Create platform-detection, filter-syntax, and dotnet-test-frameworks as hidden skills under plugins/dotnet-test/skills/. These contain the detection tables and syntax references previously duplicated across run-tests, mtp-hot-reload, and migrate-vstest-to-mtp. - Create exp-dotnet-test-frameworks as a hidden skill under plugins/dotnet-experimental/skills/ for the experimental test analysis skills (exp-test-smell-detection, exp-assertion-quality, etc.). - Update all consuming skills to reference these by skill name in backtick notation instead of file links. - Revert the skill-validator ../../shared/ exception — no longer needed since all references now use the standard skill name mechanism.

exp-test-maintainability was only 6 calibration rules with no workflow. exp-test-boilerplate-detection had the full 5-category detection workflow, examples, calibration, and validation. Merge the boilerplate content into exp-test-maintainability (the broader, more user-facing name) and add the two unique maintainability rules (DisplayName guidance, DataRow vs DynamicData preference) to Category 3. - Replace exp-test-maintainability SKILL.md with the merged content - Move test fixtures from exp-test-boilerplate-detection to exp-test-maintainability - Merge eval.yaml scenarios (4 total: 2 from each original skill) - Delete exp-test-boilerplate-detection skill and tests - Update all cross-references in exp-test-smell-detection, dotnet-test-frameworks, exp-dotnet-test-frameworks, and CODEOWNERS

…ion analysis Point users to exp-mock-usage-analysis from the Over-mocking entry and to exp-test-maintainability from the Duplicate tests entry.

Evangelink · 2026-04-07T13:03:25Z

/evaluate

github-actions · 2026-04-07T13:19:55Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
test-anti-patterns	Detect mixed severity anti-patterns in repository service tests	5.0/5 → 5.0/5	✅ test-anti-patterns; tools: report_intent, skill / ⚠️ NOT ACTIVATED	✅ 0.06	✅ [1]
test-anti-patterns	Detect flakiness indicators and test coupling	2.7/5 → 4.7/5 🟢	✅ test-anti-patterns; tools: report_intent, skill / ⚠️ NOT ACTIVATED	✅ 0.06	✅
test-anti-patterns	Detect duplicated tests and magic values	3.0/5 → 5.0/5 🟢	✅ test-anti-patterns; tools: skill, report_intent / ✅ writing-mstest-tests; test-anti-patterns; tools: report_intent, skill	✅ 0.06	✅
test-anti-patterns	Recognize well-written tests without inventing false positives	2.0/5 → 5.0/5 🟢	✅ test-anti-patterns; tools: report_intent, skill	✅ 0.06	✅
exp-test-maintainability	Recommend data-driven patterns with display names for unclear parameters	4.0/5 → 3.7/5 🔴	⚠️ NOT ACTIVATED	✅ 0.11	❌ [2]
exp-test-maintainability	Recognize well-maintained tests that need minimal changes	4.3/5 → 4.7/5 🟢	✅ exp-test-maintainability; tools: skill, report_intent / ✅ exp-test-maintainability; tools: report_intent, skill	✅ 0.11	✅ [3]
exp-test-maintainability	Detect repeated object construction and setup across test methods	3.0/5 → 4.3/5 🟢	✅ exp-test-maintainability; tools: skill, glob / ✅ exp-test-maintainability; tools: skill	✅ 0.11	✅
exp-test-maintainability	Recognize tests with minimal boilerplate that need no refactoring	2.3/5 → 4.3/5 🟢	✅ exp-test-maintainability; tools: skill / ✅ exp-test-maintainability; tools: skill, glob	✅ 0.11	✅ [4]
exp-assertion-quality	Identify low assertion diversity in equality-dominated test suite	4.0/5 → 5.0/5 🟢	✅ exp-assertion-quality; tools: skill	✅ 0.12	✅
exp-assertion-quality	Flag assertion-free tests and trivial-only assertions	3.7/5 → 4.0/5 🟢	✅ exp-assertion-quality; tools: skill	✅ 0.12	✅ [5]
exp-assertion-quality	Recognize well-diversified assertion usage	2.7/5 → 5.0/5 🟢	✅ exp-assertion-quality; tools: skill	✅ 0.12	✅
exp-assertion-quality	Decline request to write new tests from scratch	2.3/5 ⏰ → 3.7/5 ⏰ 🟢	ℹ️ not activated (expected)	✅ 0.12	❌ [6]
mtp-hot-reload	Suggest hot reload for failing test in MTP project (SDK 9)	1.0/5 → 2.3/5 ⏰ 🟢	✅ mtp-hot-reload; tools: skill	✅ 0.10	✅
mtp-hot-reload	Suggest hot reload for failing test in MTP project (SDK 10)	1.0/5 → 4.3/5 🟢	✅ mtp-hot-reload; tools: skill, bash, create	✅ 0.10	✅
mtp-hot-reload	Enable hot reload when package already installed	2.0/5 → 5.0/5 🟢	✅ mtp-hot-reload; tools: skill	✅ 0.10	✅
mtp-hot-reload	Suggest launchSettings.json configuration for hot reload	1.0/5 ⏰ → 4.0/5 🟢	✅ mtp-hot-reload; tools: skill, bash, create, glob / ✅ mtp-hot-reload; tools: skill, bash, create	✅ 0.10	✅
mtp-hot-reload	Use dotnet run not dotnet test for hot reload	2.3/5 → 3.0/5 🟢	✅ mtp-hot-reload; tools: skill / ✅ mtp-hot-reload; tools: skill, report_intent	✅ 0.10	✅ [7]
mtp-hot-reload	Negative: VSTest project cannot use MTP hot reload	1.7/5 → 2.3/5 🟢	✅ mtp-hot-reload; tools: skill, create	✅ 0.10	✅ [8]
mtp-hot-reload	Run specific failing test with hot reload filter	1.0/5 → 5.0/5 🟢	✅ mtp-hot-reload; tools: skill / ✅ mtp-hot-reload; platform-detection; tools: skill, bash	✅ 0.10	✅
run-tests	Run tests in a VSTest MSTest project	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill	✅ 0.18	✅
run-tests	Run tests with trx reporting on MTP project (SDK 9)	3.7/5 → 3.7/5	✅ run-tests; tools: skill / ✅ run-tests; platform-detection; tools: skill	✅ 0.18	✅ [9]
run-tests	Run tests with blame-hang on MTP project (SDK 10)	2.7/5 → 2.0/5 ⏰ 🔴	✅ run-tests; tools: skill, bash, edit / ⚠️ NOT ACTIVATED	✅ 0.18	❌ [10]
run-tests	Run tests in a multi-TFM project targeting a specific framework	1.7/5 → 4.0/5 🟢	✅ run-tests; tools: skill, bash, read_bash, glob / ✅ run-tests; tools: skill, bash, glob	✅ 0.18	✅
run-tests	Filter MSTest tests by category on VSTest	5.0/5 → 5.0/5	✅ run-tests; tools: skill, bash, glob / ✅ run-tests; tools: skill, bash	✅ 0.18	✅ [11]
run-tests	Filter NUnit tests by class name on VSTest	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, bash / ✅ run-tests; tools: bash, skill	✅ 0.18	✅
run-tests	Filter xUnit v3 tests by class on MTP	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, bash / ✅ run-tests; tools: skill	✅ 0.18	✅
run-tests	Filter xUnit v3 tests by trait on MTP	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view / ✅ run-tests; platform-detection; filter-syntax; tools: skill, view	✅ 0.18	✅
run-tests	Filter TUnit tests by class using treenode-filter	2.3/5 → 4.3/5 🟢	✅ run-tests; tools: skill, bash / ✅ run-tests; filter-syntax; platform-detection; tools: skill, bash	✅ 0.18	✅
run-tests	Combine multiple filter criteria on VSTest MSTest	4.0/5 → 4.3/5 🟢	⚠️ NOT ACTIVATED / ✅ run-tests; tools: skill, bash	✅ 0.18	❌ [12]
run-tests	MTP project on SDK 9 must use -- separator for args	1.0/5 → 4.3/5 🟢	✅ run-tests; tools: skill	✅ 0.18	✅ [13]
run-tests	MTP project on SDK 10 passes args directly	3.0/5 → 4.0/5 🟢	✅ run-tests; tools: skill	✅ 0.18	❌ [14]
run-tests	Detect test platform from Directory.Build.props	1.3/5 → 5.0/5 🟢	✅ run-tests; tools: skill	✅ 0.18	✅ [15]
run-tests	Negative test: do not use MTP syntax for a VSTest project	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view / ⚠️ NOT ACTIVATED	✅ 0.18	❌ [16]
migrate-vstest-to-mtp	Migrate MSTest project from VSTest to Microsoft.Testing.Platform	4.7/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: skill, report_intent	✅ 0.07	❌ [17]
migrate-vstest-to-mtp	Migrate NUnit project from VSTest to Microsoft.Testing.Platform	2.0/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill	✅ 0.07	✅
migrate-vstest-to-mtp	Migrate xUnit.net v2 project from VSTest to Microsoft.Testing.Platform	1.7/5 → 4.3/5 🟢	✅ migrate-vstest-to-mtp; tools: skill, report_intent, bash, view / ✅ migrate-vstest-to-mtp; tools: skill	✅ 0.07	✅
migrate-vstest-to-mtp	Update Azure DevOps pipeline from VSTest task to MTP	2.0/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill	✅ 0.07	✅
migrate-vstest-to-mtp	Migrate MSTest.Sdk project that explicitly uses VSTest	3.0/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill	✅ 0.07	✅
migrate-vstest-to-mtp	Translate dotnet test VSTest arguments to MTP equivalents	4.0/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill, report_intent / ✅ migrate-vstest-to-mtp; tools: skill	✅ 0.07	✅ [18]
migrate-vstest-to-mtp	Handle exit code 8 when migrating from VSTest to MTP	3.3/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill	✅ 0.07	✅
migrate-vstest-to-mtp	Configure dotnet test MTP mode on .NET 10 SDK	2.0/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill	✅ 0.07	✅
migrate-vstest-to-mtp	Migrate xUnit.net VSTest filter syntax to MTP	1.3/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill	✅ 0.07	✅
migrate-vstest-to-mtp	Full VSTest to MTP migration plan for MSTest solution	4.0/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: skill, report_intent	✅ 0.07	✅ [19]
exp-test-smell-detection	Detect multiple test smells in order processing test suite	5.0/5 → 5.0/5	✅ exp-test-smell-detection; tools: skill	✅ 0.06	❌ [20]
exp-test-smell-detection	Recognize well-written tests with no significant smells	2.7/5 → 4.7/5 🟢	✅ exp-test-smell-detection; tools: skill	✅ 0.06	✅
exp-test-smell-detection	Recognize integration tests and avoid false positives for external resources	5.0/5 → 5.0/5	✅ exp-test-smell-detection; tools: skill	✅ 0.06	❌ [21]
exp-test-smell-detection	Decline request to write new tests from scratch	4.3/5 → 4.7/5 🟢	ℹ️ not activated (expected)	✅ 0.06	❌ [22]
exp-test-tagging	Tag an untagged MSTest test suite	3.3/5 → 4.3/5 🟢	✅ exp-test-tagging; tools: skill / ✅ exp-test-tagging; tools: skill, glob, task, read_agent	✅ 0.14	✅ [23]
exp-test-tagging	Tag an untagged xUnit test suite	4.0/5 → 4.7/5 🟢	✅ exp-test-tagging; tools: skill, glob / ✅ exp-test-tagging; tools: skill, glob, grep	✅ 0.14	✅
exp-test-tagging	Tag an untagged NUnit test suite	3.3/5 → 4.7/5 🟢	✅ exp-test-tagging; tools: skill / ✅ exp-test-tagging; tools: skill, glob	✅ 0.14	✅
exp-test-tagging	Audit test distribution without modifying files	4.0/5 → 5.0/5 🟢	✅ exp-test-tagging; tools: skill / ✅ exp-test-tagging; tools: skill, glob	✅ 0.14	✅
exp-test-tagging	Decline request to write new tests	4.0/5 → 3.7/5 🔴	ℹ️ not activated (expected)	✅ 0.14	❌ [24]

[1] ⚠️ High run-to-run variance (CV=4.66) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=0.84) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=10.92) — consider re-running with --runs 5
[4] ⚠️ High run-to-run variance (CV=0.55) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=0.96) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=3.01) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -23.1% due to: judgment, quality, tokens (111793 → 171384)
[7] ⚠️ High run-to-run variance (CV=0.65) — consider re-running with --runs 5
[8] ⚠️ High run-to-run variance (CV=1.08) — consider re-running with --runs 5
[9] ⚠️ High run-to-run variance (CV=1.27) — consider re-running with --runs 5
[10] ⚠️ High run-to-run variance (CV=0.62) — consider re-running with --runs 5
[11] ⚠️ High run-to-run variance (CV=10.81) — consider re-running with --runs 5
[12] ⚠️ High run-to-run variance (CV=1.31) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -16.6% due to: judgment, quality
[13] ⚠️ High run-to-run variance (CV=1.82) — consider re-running with --runs 5
[14] ⚠️ High run-to-run variance (CV=10.92) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -41.4% due to: judgment, tokens (265819 → 584220), quality, time (158.7s → 237.3s), tool calls (18 → 26)
[15] ⚠️ High run-to-run variance (CV=0.73) — consider re-running with --runs 5
[16] ⚠️ High run-to-run variance (CV=0.79) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -0.8% due to: tokens (29655 → 34602)
[17] ⚠️ High run-to-run variance (CV=15.62) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -1.1% due to: tokens (12832 → 36156), tool calls (0 → 1)
[18] ⚠️ High run-to-run variance (CV=2.33) — consider re-running with --runs 5
[19] ⚠️ High run-to-run variance (CV=1.12) — consider re-running with --runs 5
[20] ⚠️ High run-to-run variance (CV=0.74) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -2.1% due to: tokens (60191 → 104698), time (33.6s → 76.6s), tool calls (5 → 6)
[21] (Plugin) Quality unchanged but weighted score is -10.0% due to: tokens (51503 → 109935), tool calls (4 → 8), time (36.7s → 100.8s)
[22] ⚠️ High run-to-run variance (CV=0.64) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -15.3% due to: judgment, quality
[23] ⚠️ High run-to-run variance (CV=0.66) — consider re-running with --runs 5
[24] ⚠️ High run-to-run variance (CV=1.05) — consider re-running with --runs 5

⏰ timeout — run(s) hit the (120s, 180s, 300s, 360s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

Inline the critical SDK 10 detection signal (global.json test.runner) directly in Step 1 instead of deferring entirely to the platform-detection skill. This makes the distinction between SDK 10 (no -- separator) and SDK 8/9 (requires -- separator) more prominent. Add a quick detection summary table, strengthen the Common Pitfalls entry for SDK 10 with a blame-hang-timeout example, and keep the platform-detection skill reference for the full detection logic.

Evangelink · 2026-04-07T14:19:28Z

/evaluate

github-actions · 2026-04-07T14:35:28Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
test-anti-patterns	Detect mixed severity anti-patterns in repository service tests	5.0/5 → 5.0/5	✅ test-anti-patterns; tools: report_intent, skill / ⚠️ NOT ACTIVATED	✅ 0.06	❌ [1]
test-anti-patterns	Detect flakiness indicators and test coupling	2.7/5 → 4.7/5 🟢	✅ test-anti-patterns; tools: report_intent, skill	✅ 0.06	✅ [2]
test-anti-patterns	Detect duplicated tests and magic values	3.0/5 → 5.0/5 🟢	✅ test-anti-patterns; tools: report_intent, skill / ⚠️ NOT ACTIVATED	✅ 0.06	✅
test-anti-patterns	Recognize well-written tests without inventing false positives	2.0/5 → 5.0/5 🟢	✅ test-anti-patterns; tools: report_intent, skill / ✅ test-anti-patterns; tools: skill, report_intent	✅ 0.06	✅
exp-test-maintainability	Recommend data-driven patterns with display names for unclear parameters	4.0/5 → 4.0/5	⚠️ NOT ACTIVATED	✅ 0.08	❌
exp-test-maintainability	Recognize well-maintained tests that need minimal changes	5.0/5 → 5.0/5	✅ exp-test-maintainability; tools: report_intent, skill	✅ 0.08	❌ [3]
exp-test-maintainability	Detect repeated object construction and setup across test methods	3.0/5 → 4.7/5 🟢	✅ exp-test-maintainability; tools: skill, glob / ✅ exp-test-maintainability; tools: skill	✅ 0.08	❌ [4]
exp-test-maintainability	Recognize tests with minimal boilerplate that need no refactoring	4.0/5 → 4.7/5 🟢	✅ exp-test-maintainability; tools: skill	✅ 0.08	✅ [5]
exp-assertion-quality	Identify low assertion diversity in equality-dominated test suite	4.7/5 → 5.0/5 🟢	✅ exp-assertion-quality; tools: skill / ✅ exp-assertion-quality; tools: skill, glob	✅ 0.11	❌ [6]
exp-assertion-quality	Flag assertion-free tests and trivial-only assertions	3.0/5 → 4.0/5 🟢	✅ exp-assertion-quality; tools: skill	✅ 0.11	✅
exp-assertion-quality	Recognize well-diversified assertion usage	3.0/5 → 4.3/5 🟢	✅ exp-assertion-quality; tools: skill / ✅ exp-assertion-quality; tools: skill, glob	✅ 0.11	✅
exp-assertion-quality	Decline request to write new tests from scratch	1.7/5 ⏰ → 1.3/5 ⏰ 🔴	ℹ️ not activated (expected)	✅ 0.11	❌ [7]
mtp-hot-reload	Suggest hot reload for failing test in MTP project (SDK 9)	1.0/5 → 2.7/5 ⏰ 🟢	✅ mtp-hot-reload; tools: skill, read_bash	✅ 0.14	✅
mtp-hot-reload	Suggest hot reload for failing test in MTP project (SDK 10)	1.0/5 → 4.7/5 🟢	✅ mtp-hot-reload; tools: skill, bash, create, glob / ✅ mtp-hot-reload; tools: skill, bash, create	✅ 0.14	✅
mtp-hot-reload	Enable hot reload when package already installed	2.0/5 → 5.0/5 🟢	✅ mtp-hot-reload; tools: skill	✅ 0.14	✅
mtp-hot-reload	Suggest launchSettings.json configuration for hot reload	1.0/5 → 4.0/5 🟢	✅ mtp-hot-reload; tools: skill, glob, create, bash / ✅ mtp-hot-reload; tools: skill, create, bash	✅ 0.14	✅
mtp-hot-reload	Use dotnet run not dotnet test for hot reload	2.3/5 → 3.3/5 🟢	✅ mtp-hot-reload; tools: skill	✅ 0.14	✅ [8]
mtp-hot-reload	Negative: VSTest project cannot use MTP hot reload	1.7/5 → 2.7/5 🟢	✅ mtp-hot-reload; tools: skill, create	✅ 0.14	✅ [9]
mtp-hot-reload	Run specific failing test with hot reload filter	1.0/5 → 5.0/5 🟢	✅ mtp-hot-reload; tools: skill / ✅ mtp-hot-reload; platform-detection; tools: skill, bash	✅ 0.14	✅
run-tests	Run tests in a VSTest MSTest project	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, glob	✅ 0.18	✅
run-tests	Run tests with trx reporting on MTP project (SDK 9)	3.7/5 → 3.3/5 ⏰ 🔴	✅ run-tests; tools: skill / ✅ run-tests; tools: skill, glob	✅ 0.18	✅ [10]
run-tests	Run tests with blame-hang on MTP project (SDK 10)	2.0/5 → 2.7/5 🟢	✅ run-tests; tools: skill, bash, edit / ⚠️ NOT ACTIVATED	✅ 0.18	✅ [11]
run-tests	Run tests in a multi-TFM project targeting a specific framework	2.0/5 → 4.3/5 🟢	✅ run-tests; tools: skill, glob, bash, read_bash / ✅ run-tests; tools: skill, glob, bash	✅ 0.18	✅
run-tests	Filter MSTest tests by category on VSTest	5.0/5 → 5.0/5	✅ run-tests; tools: skill, glob / ✅ run-tests; tools: skill	✅ 0.18	❌ [12]
run-tests	Filter NUnit tests by class name on VSTest	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, glob / ✅ run-tests; tools: skill, bash	✅ 0.18	✅ [13]
run-tests	Filter xUnit v3 tests by class on MTP	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, bash / ✅ run-tests; tools: skill	✅ 0.18	✅ [14]
run-tests	Filter xUnit v3 tests by trait on MTP	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view	✅ 0.18	✅
run-tests	Filter TUnit tests by class using treenode-filter	3.0/5 → 4.0/5 🟢	✅ run-tests; tools: skill, bash / ⚠️ NOT ACTIVATED	✅ 0.18	✅ [15]
run-tests	Combine multiple filter criteria on VSTest MSTest	4.7/5 → 4.7/5	✅ run-tests; tools: skill / ✅ run-tests; tools: skill, glob	✅ 0.18	❌ [16]
run-tests	MTP project on SDK 9 must use -- separator for args	1.3/5 → 4.3/5 🟢	✅ run-tests; tools: skill / ✅ run-tests; tools: edit, skill	✅ 0.18	✅
run-tests	MTP project on SDK 10 passes args directly	1.7/5 ⏰ → 2.3/5 ⏰ 🟢	✅ run-tests; tools: skill, create / ⚠️ NOT ACTIVATED	✅ 0.18	✅ [17]
run-tests	Detect test platform from Directory.Build.props	2.3/5 → 5.0/5 🟢	✅ run-tests; tools: skill	✅ 0.18	❌ [18]
run-tests	Negative test: do not use MTP syntax for a VSTest project	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view, glob	✅ 0.18	✅ [19]
migrate-vstest-to-mtp	Migrate MSTest project from VSTest to Microsoft.Testing.Platform	3.7/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: report_intent, skill	✅ 0.11	✅ [20]
migrate-vstest-to-mtp	Migrate NUnit project from VSTest to Microsoft.Testing.Platform	1.7/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: skill, report_intent	✅ 0.11	✅
migrate-vstest-to-mtp	Migrate xUnit.net v2 project from VSTest to Microsoft.Testing.Platform	2.0/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill, report_intent, glob, bash, view / ✅ migrate-vstest-to-mtp; tools: skill, report_intent, view, bash	✅ 0.11	✅
migrate-vstest-to-mtp	Update Azure DevOps pipeline from VSTest task to MTP	2.3/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: skill, report_intent	✅ 0.11	✅
migrate-vstest-to-mtp	Migrate MSTest.Sdk project that explicitly uses VSTest	3.0/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill	✅ 0.11	✅
migrate-vstest-to-mtp	Translate dotnet test VSTest arguments to MTP equivalents	4.0/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill	✅ 0.11	✅ [21]
migrate-vstest-to-mtp	Handle exit code 8 when migrating from VSTest to MTP	3.0/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill, glob / ✅ migrate-vstest-to-mtp; tools: skill	✅ 0.11	✅ [22]
migrate-vstest-to-mtp	Configure dotnet test MTP mode on .NET 10 SDK	2.0/5 → 4.7/5 🟢	✅ migrate-vstest-to-mtp; tools: skill	✅ 0.11	✅
migrate-vstest-to-mtp	Migrate xUnit.net VSTest filter syntax to MTP	1.3/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill	✅ 0.11	✅
migrate-vstest-to-mtp	Full VSTest to MTP migration plan for MSTest solution	4.0/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill	✅ 0.11	✅ [23]
exp-test-smell-detection	Detect multiple test smells in order processing test suite	4.3/5 → 5.0/5 🟢	✅ exp-test-smell-detection; tools: skill	✅ 0.05	✅ [24]
exp-test-smell-detection	Recognize well-written tests with no significant smells	3.3/5 → 4.7/5 🟢	✅ exp-test-smell-detection; tools: skill	✅ 0.05	✅
exp-test-smell-detection	Recognize integration tests and avoid false positives for external resources	5.0/5 → 5.0/5	✅ exp-test-smell-detection; tools: skill	✅ 0.05	❌ [25]
exp-test-smell-detection	Decline request to write new tests from scratch	4.7/5 → 4.7/5	ℹ️ not activated (expected)	✅ 0.05	✅ [26]
exp-test-tagging	Tag an untagged MSTest test suite	3.7/5 → 4.7/5 🟢	✅ exp-test-tagging; tools: skill, glob	🟡 0.26	✅
exp-test-tagging	Tag an untagged xUnit test suite	3.7/5 → 5.0/5 🟢	✅ exp-test-tagging; tools: skill, glob / ✅ exp-test-tagging; tools: skill, task, glob, read_agent	🟡 0.26	✅
exp-test-tagging	Tag an untagged NUnit test suite	4.0/5 → 4.7/5 🟢	✅ exp-test-tagging; tools: skill, glob	🟡 0.26	✅
exp-test-tagging	Audit test distribution without modifying files	4.0/5 → 5.0/5 🟢	✅ exp-test-tagging; tools: skill / ✅ exp-test-tagging; tools: skill, glob	🟡 0.26	❌ [27]
exp-test-tagging	Decline request to write new tests	4.0/5 → 4.0/5	ℹ️ not activated (expected)	🟡 0.26	❌ [28]

[1] (Plugin) Quality unchanged but weighted score is -3.2% due to: tokens (13312 → 17063), time (18.4s → 26.3s), quality
[2] ⚠️ High run-to-run variance (CV=0.91) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=1.17) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -14.9% due to: judgment, tokens (20408 → 41086), tool calls (0 → 1), time (29.1s → 50.1s)
[4] ⚠️ High run-to-run variance (CV=9.10) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -2.0% due to: errors (0 → 1), time (80.3s → 105.6s)
[5] ⚠️ High run-to-run variance (CV=0.78) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=32.58) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -0.8% due to: tokens (47765 → 88954), time (43.7s → 93.5s), tool calls (4 → 7)
[7] ⚠️ High run-to-run variance (CV=1.13) — consider re-running with --runs 5
[8] ⚠️ High run-to-run variance (CV=0.66) — consider re-running with --runs 5
[9] ⚠️ High run-to-run variance (CV=1.61) — consider re-running with --runs 5
[10] ⚠️ High run-to-run variance (CV=1.65) — consider re-running with --runs 5. (Plugin) Quality dropped but weighted score is +3.4% due to: efficiency metrics
[11] ⚠️ High run-to-run variance (CV=0.75) — consider re-running with --runs 5
[12] (Plugin) Quality unchanged but weighted score is -5.0% due to: tokens (36232 → 59215), quality, time (28.2s → 35.0s)
[13] ⚠️ High run-to-run variance (CV=1.54) — consider re-running with --runs 5
[14] ⚠️ High run-to-run variance (CV=0.57) — consider re-running with --runs 5
[15] ⚠️ High run-to-run variance (CV=0.99) — consider re-running with --runs 5
[16] ⚠️ High run-to-run variance (CV=1.87) — consider re-running with --runs 5
[17] ⚠️ High run-to-run variance (CV=7.83) — consider re-running with --runs 5
[18] ⚠️ High run-to-run variance (CV=12.14) — consider re-running with --runs 5
[19] ⚠️ High run-to-run variance (CV=0.99) — consider re-running with --runs 5
[20] ⚠️ High run-to-run variance (CV=0.59) — consider re-running with --runs 5
[21] ⚠️ High run-to-run variance (CV=2.10) — consider re-running with --runs 5
[22] ⚠️ High run-to-run variance (CV=0.52) — consider re-running with --runs 5
[23] ⚠️ High run-to-run variance (CV=1.36) — consider re-running with --runs 5
[24] ⚠️ High run-to-run variance (CV=1.45) — consider re-running with --runs 5
[25] ⚠️ High run-to-run variance (CV=2.00) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -8.0% due to: tokens (45583 → 78530), time (41.1s → 104.7s), tool calls (4 → 7)
[26] ⚠️ High run-to-run variance (CV=5.81) — consider re-running with --runs 5
[27] ⚠️ High run-to-run variance (CV=3.78) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -64.6% due to: quality, judgment, tokens (86736 → 119886), time (70.2s → 116.4s), tool calls (10 → 14)
[28] ⚠️ High run-to-run variance (CV=1.33) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -54.3% due to: judgment, quality

⏰ timeout — run(s) hit the (120s, 300s, 360s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

- exp-test-maintainability: Add 'suggest a better test structure', 'consolidate similar test methods', 'convert copy-paste tests to data-driven parameterized tests' to match prompts like 'each new case needs a whole new method, suggest a better structure'. - test-anti-patterns: Add 'review tests', 'find test problems', 'check test quality', 'audit tests for common mistakes' to match review-style prompts that don't use the word 'anti-pattern'. - run-tests: Add 'hang timeout', 'blame-hang', 'blame-crash', 'TUnit' to match SDK 10 blame scenarios and TUnit filter scenarios that were intermittently not activating.

Evangelink · 2026-04-07T15:52:56Z

/evaluate

Copilot

Pull request overview

This PR consolidates duplicated “test boilerplate detection” content into the existing exp-test-maintainability evaluation, and restructures .NET testing reference material by promoting platform/filter/framework references into dedicated (non-invocable) skills that other skills can link to.

Changes:

Moved the heavy/minimal boilerplate evaluation scenarios into tests/dotnet-experimental/exp-test-maintainability/eval.yaml and removed exp-test-boilerplate-detection eval/skill content.
Added new test fixture projects (Calculator.Tests, OrderService.Tests) to support maintainability scenarios.
Refactored skill documentation to reference shared platform-detection, filter-syntax, and test-framework reference skills, and updated CODEOWNERS accordingly.

Show a summary per file

File	Description
tests/dotnet-experimental/exp-test-maintainability/fixtures/minimal-boilerplate/Calculator.Tests/CalculatorTests.cs	Adds “minimal boilerplate” MSTest sample tests used as an evaluation fixture.
tests/dotnet-experimental/exp-test-maintainability/fixtures/minimal-boilerplate/Calculator.Tests/Calculator.Tests.csproj	Adds MSTest fixture project metadata for the minimal-boilerplate scenario.
tests/dotnet-experimental/exp-test-maintainability/fixtures/heavy-boilerplate/OrderService.Tests/OrderService.Tests.csproj	Adds MSTest fixture project metadata for the heavy-boilerplate scenario.
tests/dotnet-experimental/exp-test-maintainability/fixtures/heavy-boilerplate/OrderService.Tests/OrderProcessorTests.cs	Adds “heavy boilerplate” test fixture to drive maintainability recommendations.
tests/dotnet-experimental/exp-test-maintainability/eval.yaml	Adds new scenarios for heavy/minimal boilerplate and related rubric/assertions.
tests/dotnet-experimental/exp-test-boilerplate-detection/eval.yaml	Removes now-redundant eval scenarios after consolidation.
plugins/dotnet-test/skills/test-anti-patterns/SKILL.md	Updates positioning and cross-links to related skills/framework reference skill.
plugins/dotnet-test/skills/run-tests/SKILL.md	Updates detection guidance and references `platform-detection` / `filter-syntax` skills.
plugins/dotnet-test/skills/run-tests/references/platform-detection.md	Deleted in favor of the `platform-detection` skill.
plugins/dotnet-test/skills/platform-detection/SKILL.md	Adds skill frontmatter to make platform detection a shared reference skill.
plugins/dotnet-test/skills/mtp-hot-reload/SKILL.md	Updates references to the centralized `platform-detection` / `filter-syntax` skills.
plugins/dotnet-test/skills/mtp-hot-reload/references/filter-syntax.md	Deleted in favor of the `filter-syntax` skill.
plugins/dotnet-test/skills/migrate-vstest-to-mtp/SKILL.md	Updates references to centralized platform/filter reference skills.
plugins/dotnet-test/skills/migrate-vstest-to-mtp/references/platform-detection.md	Deleted in favor of the `platform-detection` skill.
plugins/dotnet-test/skills/migrate-vstest-to-mtp/references/filter-syntax.md	Deleted in favor of the `filter-syntax` skill.
plugins/dotnet-test/skills/filter-syntax/SKILL.md	Adds skill frontmatter to make filter syntax a shared reference skill.
plugins/dotnet-test/skills/dotnet-test-frameworks/SKILL.md	Adds a centralized, non-invocable framework reference skill for non-experimental skills.
plugins/dotnet-experimental/skills/exp-test-tagging/SKILL.md	Switches framework detection guidance to reference `exp-dotnet-test-frameworks`.
plugins/dotnet-experimental/skills/exp-test-smell-detection/SKILL.md	Updates references from old boilerplate skill to `exp-test-maintainability` and framework reference skill.
plugins/dotnet-experimental/skills/exp-test-maintainability/SKILL.md	Expands/clarifies maintainability scope to include duplication/boilerplate detection.
plugins/dotnet-experimental/skills/exp-test-boilerplate-detection/SKILL.md	Removes redundant skill after consolidation into `exp-test-maintainability`.
plugins/dotnet-experimental/skills/exp-dotnet-test-frameworks/SKILL.md	Adds skill frontmatter to make it a shared reference skill for experimental skills.
plugins/dotnet-experimental/skills/exp-assertion-quality/SKILL.md	Updates scanning guidance to reference `exp-dotnet-test-frameworks`.
.github/CODEOWNERS	Updates ownership entries to reflect added/removed experimental skill paths.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Files reviewed: 20/24 changed files
Comments generated: 0

github-actions · 2026-04-07T16:10:46Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
test-anti-patterns	Detect mixed severity anti-patterns in repository service tests	5.0/5 → 5.0/5	✅ test-anti-patterns; tools: report_intent, skill / ⚠️ NOT ACTIVATED	✅ 0.06	✅ [1]
test-anti-patterns	Detect flakiness indicators and test coupling	3.0/5 → 5.0/5 🟢	✅ test-anti-patterns; tools: report_intent, skill / ⚠️ NOT ACTIVATED	✅ 0.06	✅
test-anti-patterns	Detect duplicated tests and magic values	3.0/5 → 5.0/5 🟢	✅ test-anti-patterns; tools: skill, report_intent / ⚠️ NOT ACTIVATED	✅ 0.06	✅ [2]
test-anti-patterns	Recognize well-written tests without inventing false positives	2.0/5 → 5.0/5 🟢	✅ test-anti-patterns; tools: report_intent, skill / ✅ test-anti-patterns; tools: skill, report_intent	✅ 0.06	✅
exp-test-maintainability	Recommend data-driven patterns with display names for unclear parameters	4.0/5 → 4.3/5 🟢	✅ exp-test-maintainability; tools: report_intent, skill / ⚠️ NOT ACTIVATED	✅ 0.11	❌ [3]
exp-test-maintainability	Recognize well-maintained tests that need minimal changes	4.3/5 → 5.0/5 🟢	⚠️ NOT ACTIVATED / ✅ exp-test-maintainability; tools: report_intent, skill	✅ 0.11	✅ [4]
exp-test-maintainability	Detect repeated object construction and setup across test methods	3.0/5 → 4.7/5 🟢	✅ exp-test-maintainability; tools: skill, glob / ✅ exp-test-maintainability; tools: skill	✅ 0.11	✅
exp-test-maintainability	Recognize tests with minimal boilerplate that need no refactoring	3.3/5 → 4.7/5 🟢	✅ exp-test-maintainability; tools: skill	✅ 0.11	✅ [5]
exp-assertion-quality	Identify low assertion diversity in equality-dominated test suite	3.3/5 ⏰ → 4.7/5 ⏰ 🟢	✅ exp-assertion-quality; tools: skill / ✅ exp-assertion-quality; tools: skill, glob	✅ 0.09	✅ [6]
exp-assertion-quality	Flag assertion-free tests and trivial-only assertions	3.3/5 → 4.3/5 🟢	✅ exp-assertion-quality; tools: skill	✅ 0.09	✅ [7]
exp-assertion-quality	Recognize well-diversified assertion usage	3.0/5 → 4.7/5 🟢	✅ exp-assertion-quality; tools: skill	✅ 0.09	✅
exp-assertion-quality	Decline request to write new tests from scratch	1.7/5 ⏰ → 1.0/5 ⏰ 🔴	ℹ️ not activated (expected)	✅ 0.09	❌ [8]
mtp-hot-reload	Suggest hot reload for failing test in MTP project (SDK 9)	1.0/5 → 2.7/5 ⏰ 🟢	✅ mtp-hot-reload; tools: skill, report_intent, view, bash, read_bash, edit, glob	✅ 0.10	✅
mtp-hot-reload	Suggest hot reload for failing test in MTP project (SDK 10)	1.0/5 → 4.0/5 🟢	✅ mtp-hot-reload; tools: skill, bash, create, glob / ✅ mtp-hot-reload; tools: skill, bash, create	✅ 0.10	✅
mtp-hot-reload	Enable hot reload when package already installed	2.0/5 → 5.0/5 🟢	✅ mtp-hot-reload; tools: skill	✅ 0.10	✅
mtp-hot-reload	Suggest launchSettings.json configuration for hot reload	1.0/5 → 4.0/5 🟢	✅ mtp-hot-reload; tools: skill, bash, create	✅ 0.10	✅
mtp-hot-reload	Use dotnet run not dotnet test for hot reload	2.3/5 → 3.3/5 🟢	✅ mtp-hot-reload; tools: skill / ✅ mtp-hot-reload; tools: skill, report_intent	✅ 0.10	✅ [9]
mtp-hot-reload	Negative: VSTest project cannot use MTP hot reload	1.7/5 → 2.0/5 🟢	✅ mtp-hot-reload; tools: skill, create	✅ 0.10	✅
mtp-hot-reload	Run specific failing test with hot reload filter	1.0/5 → 4.0/5 🟢	✅ mtp-hot-reload; tools: skill	✅ 0.10	✅
run-tests	Run tests in a VSTest MSTest project	4.0/5 → 4.0/5	✅ run-tests; tools: skill, glob	✅ 0.14	✅ [10]
run-tests	Run tests with trx reporting on MTP project (SDK 9)	3.7/5 → 3.3/5 ⏰ 🔴	✅ run-tests; tools: skill	✅ 0.14	❌ [11]
run-tests	Run tests with blame-hang on MTP project (SDK 10)	2.3/5 → 4.0/5 ⏰ 🟢	✅ run-tests; tools: skill, bash, edit	✅ 0.14	❌ [12]
run-tests	Run tests in a multi-TFM project targeting a specific framework	2.0/5 → 4.3/5 🟢	✅ run-tests; tools: bash, skill, glob / ⚠️ NOT ACTIVATED	✅ 0.14	✅
run-tests	Filter MSTest tests by category on VSTest	5.0/5 → 5.0/5	⚠️ NOT ACTIVATED	✅ 0.14	✅ [13]
run-tests	Filter NUnit tests by class name on VSTest	3.7/5 → 5.0/5 🟢	✅ run-tests; tools: skill, glob, bash / ⚠️ NOT ACTIVATED	✅ 0.14	✅
run-tests	Filter xUnit v3 tests by class on MTP	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED	✅ 0.14	✅ [14]
run-tests	Filter xUnit v3 tests by trait on MTP	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view	✅ 0.14	✅
run-tests	Filter TUnit tests by class using treenode-filter	1.7/5 → 4.3/5 🟢	✅ run-tests; tools: skill, glob, bash / ⚠️ NOT ACTIVATED	✅ 0.14	✅
run-tests	Combine multiple filter criteria on VSTest MSTest	4.7/5 → 4.3/5 🔴	⚠️ NOT ACTIVATED / ✅ run-tests; tools: skill	✅ 0.14	✅ [15]
run-tests	MTP project on SDK 9 must use -- separator for args	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill	✅ 0.14	✅
run-tests	MTP project on SDK 10 passes args directly	2.7/5 → 3.3/5 🟢	✅ run-tests; tools: skill	✅ 0.14	✅ [16]
run-tests	Detect test platform from Directory.Build.props	1.3/5 → 5.0/5 🟢	✅ run-tests; tools: skill	✅ 0.14	✅ [17]
run-tests	Negative test: do not use MTP syntax for a VSTest project	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view, glob / ✅ run-tests; tools: skill, view	✅ 0.14	✅ [18]
migrate-vstest-to-mtp	Migrate MSTest project from VSTest to Microsoft.Testing.Platform	4.7/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: report_intent, skill	✅ 0.07	❌ [19]
migrate-vstest-to-mtp	Migrate NUnit project from VSTest to Microsoft.Testing.Platform	2.0/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill	✅ 0.07	✅
migrate-vstest-to-mtp	Migrate xUnit.net v2 project from VSTest to Microsoft.Testing.Platform	2.0/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill, report_intent, glob, bash, view / ✅ migrate-vstest-to-mtp; tools: skill	✅ 0.07	✅
migrate-vstest-to-mtp	Update Azure DevOps pipeline from VSTest task to MTP	2.7/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill	✅ 0.07	✅
migrate-vstest-to-mtp	Migrate MSTest.Sdk project that explicitly uses VSTest	3.0/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill	✅ 0.07	✅
migrate-vstest-to-mtp	Translate dotnet test VSTest arguments to MTP equivalents	4.3/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill, report_intent / ✅ migrate-vstest-to-mtp; tools: skill	✅ 0.07	✅ [20]
migrate-vstest-to-mtp	Handle exit code 8 when migrating from VSTest to MTP	3.0/5 → 4.7/5 🟢	✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: stop_bash, skill	✅ 0.07	✅ [21]
migrate-vstest-to-mtp	Configure dotnet test MTP mode on .NET 10 SDK	2.0/5 → 4.7/5 🟢	✅ migrate-vstest-to-mtp; tools: skill	✅ 0.07	✅
migrate-vstest-to-mtp	Migrate xUnit.net VSTest filter syntax to MTP	2.0/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill	✅ 0.07	✅ [22]
migrate-vstest-to-mtp	Full VSTest to MTP migration plan for MSTest solution	2.7/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill	✅ 0.07	✅ [23]
exp-test-smell-detection	Detect multiple test smells in order processing test suite	4.0/5 → 5.0/5 🟢	✅ exp-test-smell-detection; tools: skill	✅ 0.06	✅
exp-test-smell-detection	Recognize well-written tests with no significant smells	2.7/5 → 4.7/5 🟢	✅ exp-test-smell-detection; tools: skill	✅ 0.06	✅
exp-test-smell-detection	Recognize integration tests and avoid false positives for external resources	5.0/5 → 5.0/5	✅ exp-test-smell-detection; tools: skill	✅ 0.06	❌ [24]
exp-test-smell-detection	Decline request to write new tests from scratch	4.7/5 → 4.7/5	ℹ️ not activated (expected)	✅ 0.06	❌ [25]
exp-test-tagging	Tag an untagged MSTest test suite	3.7/5 → 4.7/5 🟢	✅ exp-test-tagging; tools: skill, glob / ✅ exp-test-tagging; tools: skill, glob, task, read_agent	✅ 0.17	✅
exp-test-tagging	Tag an untagged xUnit test suite	3.7/5 → 4.3/5 🟢	✅ exp-test-tagging; tools: skill, glob / ✅ exp-test-tagging; tools: skill, task, read_agent, grep	✅ 0.17	✅ [26]
exp-test-tagging	Tag an untagged NUnit test suite	4.0/5 → 5.0/5 🟢	✅ exp-test-tagging; tools: skill, glob	✅ 0.17	✅
exp-test-tagging	Audit test distribution without modifying files	4.0/5 → 5.0/5 🟢	✅ exp-test-tagging; tools: skill, glob	✅ 0.17	✅
exp-test-tagging	Decline request to write new tests	4.3/5 → 4.0/5 🔴	ℹ️ not activated (expected)	✅ 0.17	❌ [27]

[1] ⚠️ High run-to-run variance (CV=9.99) — consider re-running with --runs 5. (Plugin) Quality dropped but weighted score is +5.6% due to: completion (✗ → ✓)
[2] ⚠️ High run-to-run variance (CV=0.58) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=1.26) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -1.6% due to: tokens (18774 → 23110)
[4] ⚠️ High run-to-run variance (CV=1.47) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=0.90) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=1.21) — consider re-running with --runs 5
[7] ⚠️ High run-to-run variance (CV=0.52) — consider re-running with --runs 5
[8] ⚠️ High run-to-run variance (CV=1.58) — consider re-running with --runs 5
[9] ⚠️ High run-to-run variance (CV=0.91) — consider re-running with --runs 5
[10] ⚠️ High run-to-run variance (CV=0.59) — consider re-running with --runs 5
[11] ⚠️ High run-to-run variance (CV=21.88) — consider re-running with --runs 5
[12] ⚠️ High run-to-run variance (CV=4.73) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -60.0% due to: judgment, quality, tokens (34756 → 250897), tool calls (4 → 12), time (23.8s → 121.2s)
[13] ⚠️ High run-to-run variance (CV=1.00) — consider re-running with --runs 5
[14] ⚠️ High run-to-run variance (CV=0.55) — consider re-running with --runs 5
[15] ⚠️ High run-to-run variance (CV=5.30) — consider re-running with --runs 5
[16] ⚠️ High run-to-run variance (CV=1.46) — consider re-running with --runs 5
[17] ⚠️ High run-to-run variance (CV=1.70) — consider re-running with --runs 5
[18] ⚠️ High run-to-run variance (CV=1.02) — consider re-running with --runs 5
[19] ⚠️ High run-to-run variance (CV=10.48) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -0.9% due to: tokens (12739 → 36224), tool calls (0 → 2)
[20] ⚠️ High run-to-run variance (CV=2.48) — consider re-running with --runs 5
[21] ⚠️ High run-to-run variance (CV=0.51) — consider re-running with --runs 5
[22] ⚠️ High run-to-run variance (CV=0.54) — consider re-running with --runs 5
[23] ⚠️ High run-to-run variance (CV=0.59) — consider re-running with --runs 5
[24] ⚠️ High run-to-run variance (CV=4.00) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -7.8% due to: tokens (47540 → 83581), tool calls (4 → 8), time (34.6s → 89.3s)
[25] ⚠️ High run-to-run variance (CV=0.78) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -30.6% due to: quality, judgment, tokens (150151 → 213552), time (79.8s → 97.0s)
[26] ⚠️ High run-to-run variance (CV=0.56) — consider re-running with --runs 5
[27] ⚠️ High run-to-run variance (CV=2.36) — consider re-running with --runs 5

⏰ timeout — run(s) hit the (120s, 300s, 360s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

* Deduplicate test skill references and clarify skill boundaries - Move platform-detection.md and filter-syntax.md to plugins/dotnet-test/shared/, removing 3 identical copies of each from run-tests, mtp-hot-reload, and migrate-vstest-to-mtp reference directories. - Move dotnet.md from exp-test-smell-detection/extensions/ to shared/ as dotnet-test-frameworks.md in both dotnet-test and dotnet-experimental plugins. Update exp-assertion-quality, exp-test-boilerplate-detection, exp-test-tagging, and test-anti-patterns to reference the shared file instead of inlining framework detection tables. - Differentiate test-anti-patterns (quick pragmatic review) from exp-test-smell-detection (deep formal audit with academic taxonomy) by updating descriptions and cross-referencing each other in When Not to Use sections. - Update skill-validator to allow ../../shared/ file references while still blocking other parent-directory traversals. Add tests for the new rule. * Switch from shared/ directories to hidden reference skills Replace the plugin-level shared/ directories with non-invocable reference skills (user-invocable: false) that other skills reference by name. - Create platform-detection, filter-syntax, and dotnet-test-frameworks as hidden skills under plugins/dotnet-test/skills/. These contain the detection tables and syntax references previously duplicated across run-tests, mtp-hot-reload, and migrate-vstest-to-mtp. - Create exp-dotnet-test-frameworks as a hidden skill under plugins/dotnet-experimental/skills/ for the experimental test analysis skills (exp-test-smell-detection, exp-assertion-quality, etc.). - Update all consuming skills to reference these by skill name in backtick notation instead of file links. - Revert the skill-validator ../../shared/ exception — no longer needed since all references now use the standard skill name mechanism. * Merge exp-test-boilerplate-detection into exp-test-maintainability exp-test-maintainability was only 6 calibration rules with no workflow. exp-test-boilerplate-detection had the full 5-category detection workflow, examples, calibration, and validation. Merge the boilerplate content into exp-test-maintainability (the broader, more user-facing name) and add the two unique maintainability rules (DisplayName guidance, DataRow vs DynamicData preference) to Category 3. - Replace exp-test-maintainability SKILL.md with the merged content - Move test fixtures from exp-test-boilerplate-detection to exp-test-maintainability - Merge eval.yaml scenarios (4 total: 2 from each original skill) - Delete exp-test-boilerplate-detection skill and tests - Update all cross-references in exp-test-smell-detection, dotnet-test-frameworks, exp-dotnet-test-frameworks, and CODEOWNERS * Add cross-references to test-anti-patterns for deep mock and duplication analysis Point users to exp-mock-usage-analysis from the Over-mocking entry and to exp-test-maintainability from the Duplicate tests entry. * Add exp-dotnet-test-frameworks to CODEOWNERS * Improve run-tests SDK 10 MTP detection for blame-hang scenario Inline the critical SDK 10 detection signal (global.json test.runner) directly in Step 1 instead of deferring entirely to the platform-detection skill. This makes the distinction between SDK 10 (no -- separator) and SDK 8/9 (requires -- separator) more prominent. Add a quick detection summary table, strengthen the Common Pitfalls entry for SDK 10 with a blame-hang-timeout example, and keep the platform-detection skill reference for the full detection logic. * Improve skill activation keywords in descriptions - exp-test-maintainability: Add 'suggest a better test structure', 'consolidate similar test methods', 'convert copy-paste tests to data-driven parameterized tests' to match prompts like 'each new case needs a whole new method, suggest a better structure'. - test-anti-patterns: Add 'review tests', 'find test problems', 'check test quality', 'audit tests for common mistakes' to match review-style prompts that don't use the word 'anti-pattern'. - run-tests: Add 'hang timeout', 'blame-hang', 'blame-crash', 'TUnit' to match SDK 10 blame scenarios and TUnit filter scenarios that were intermittently not activating.

Evangelink added 5 commits April 7, 2026 14:07

Add cross-references to test-anti-patterns for deep mock and duplicat…

83c67dd

…ion analysis Point users to exp-mock-usage-analysis from the Over-mocking entry and to exp-test-maintainability from the Duplicate tests entry.

Add exp-dotnet-test-frameworks to CODEOWNERS

b357bf1

github-actions Bot added a commit that referenced this pull request Apr 7, 2026

Update PR token usage data (PR #501)

ae7ddf2

github-actions Bot added a commit that referenced this pull request Apr 7, 2026

Update session data (PR #501)

45ee04e

github-actions Bot added a commit that referenced this pull request Apr 7, 2026

Update PR token usage data (PR #501)

b19504a

github-actions Bot added a commit that referenced this pull request Apr 7, 2026

Update session data (PR #501)

200c129

Evangelink marked this pull request as ready for review April 7, 2026 15:53

Evangelink requested review from dbreshears and timheuer as code owners April 7, 2026 15:53

Copilot AI review requested due to automatic review settings April 7, 2026 15:53

Copilot started reviewing on behalf of Evangelink April 7, 2026 15:53 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

github-actions Bot added a commit that referenced this pull request Apr 7, 2026

Update PR token usage data (PR #501)

457fd9c

github-actions Bot added a commit that referenced this pull request Apr 7, 2026

Update session data (PR #501)

c6160e6

JanKrivanek approved these changes Apr 8, 2026

View reviewed changes

Evangelink merged commit 93dc33d into main Apr 8, 2026
36 checks passed

Evangelink deleted the dev/amauryleve/unify branch April 8, 2026 11:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stabilize and unify some test skills#501

Stabilize and unify some test skills#501
Evangelink merged 7 commits into
mainfrom
dev/amauryleve/unify

Evangelink commented Apr 7, 2026

Uh oh!

Evangelink commented Apr 7, 2026

Uh oh!

github-actions Bot commented Apr 7, 2026

Uh oh!

Evangelink commented Apr 7, 2026

Uh oh!

github-actions Bot commented Apr 7, 2026

Uh oh!

Evangelink commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions Bot commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Evangelink commented Apr 7, 2026

Uh oh!

Evangelink commented Apr 7, 2026

Uh oh!

github-actions Bot commented Apr 7, 2026

Skill Validation Results

Uh oh!

Evangelink commented Apr 7, 2026

Uh oh!

github-actions Bot commented Apr 7, 2026

Skill Validation Results

Uh oh!

Evangelink commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

github-actions Bot commented Apr 7, 2026

Skill Validation Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants