Skip to content

Add AGENTVIZ session replay integration#494

Merged
JanKrivanek merged 4 commits into
mainfrom
dev/jankrivanek/agentviz-integration-poc
Apr 2, 2026
Merged

Add AGENTVIZ session replay integration#494
JanKrivanek merged 4 commits into
mainfrom
dev/jankrivanek/agentviz-integration-poc

Conversation

@JanKrivanek
Copy link
Copy Markdown
Member

@JanKrivanek JanKrivanek commented Apr 1, 2026

Integrates AGENTVIZ session replay visualization into the skills evaluation pipeline and dashboard.

Changes

  • evaluation.yml: Add workflow_dispatch trigger, --keep-sessions flag, publish-session-data job, replay links in PR comments, AGENTVIZ SPA deployment in deploy-dashboard
  • build-replay-sessions.ps1: Manifest generation from sessions.db -- flattens JSONL files and creates AGENTVIZ-compatible manifest
  • purge-replay-sessions.ps1: 7-day retention management for session data
  • dashboard.js: Per-plugin Sessions Visualisation links

How it works

  1. evaluate job now runs with --keep-sessions, preserving native SDK events.jsonl files
  2. New publish-session-data job flattens JSONL and pushes manifest + sessions to dashboard-session-data branch
  3. PR comments include a replay link pointing to the AGENTVIZ SPA on gh-pages/replay/
  4. AGENTVIZ SPA is built and deployed during deploy-dashboard with skip-if-unchanged guard

Prerequisites already deployed

  • dashboard-session-data branch created with stub manifest
  • AGENTVIZ SPA deployed to gh-pages/replay/

See docs/agentviz-integration-plan.md for full design.

- Add workflow_dispatch trigger to evaluation.yml
- Add --keep-sessions to skill-validator evaluate step
- Add publish-session-data job (mirrors publish-token-data)
- Add replay link to PR comments (comment-on-pr)
- Add AGENTVIZ SPA build/deploy to deploy-dashboard job
- Add setup-node step to deploy-dashboard
- Add per-plugin Sessions Visualisation links to dashboard.js
- Create build-replay-sessions.ps1 (manifest generation from sessions.db)
- Create purge-replay-sessions.ps1 (7-day retention management)
@JanKrivanek
Copy link
Copy Markdown
Member Author

/evaluate

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 1, 2026

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
dotnet-maui-doctor Plan macOS MAUI setup with Xcode 3.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: skill, report_intent, view / ✅ dotnet-maui-doctor; tools: report_intent, skill, view ✅ 0.17
dotnet-maui-doctor Plan Linux MAUI environment for Android 3.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: skill, view ✅ 0.17
dotnet-maui-doctor Guardrail against workload update and repair 1.0/5 → 3.0/5 🟢 ✅ dotnet-maui-doctor; tools: report_intent, skill ✅ 0.17
dotnet-maui-doctor Diagnose non-Microsoft JDK causing build failure 4.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: report_intent, skill, view ✅ 0.17 [1]
dotnet-maui-doctor Plan complete MAUI setup on Windows 4.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: report_intent, skill, view ✅ 0.17 [2]
dotnet-maui-doctor Prevent incorrect JAVA_HOME configuration 2.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: report_intent, skill ✅ 0.17
dotnet-maui-doctor Determine required Android SDK packages for specific .NET version 3.0/5 → 4.0/5 🟢 ✅ dotnet-maui-doctor; tools: skill, view ✅ 0.17
dotnet-maui-doctor Fix stale MAUI workloads after SDK update 2.0/5 → 4.0/5 🟢 ✅ dotnet-maui-doctor; tools: report_intent, skill, view ✅ 0.17
optimizing-ef-core-queries Optimize bulk operations with EF Core 7+ ExecuteUpdate and ExecuteDelete 5.0/5 → 4.0/5 🔴 ✅ optimizing-ef-core-queries; tools: skill / ✅ optimizing-ef-core-queries; tools: report_intent, skill 🟡 0.28

[1] (Plugin) Quality improved but weighted score is -1.9% due to: completion (✓ → ✗), tokens (13184 → 47663), tool calls (0 → 5)
[2] (Isolated) Quality improved but weighted score is -5.6% due to: completion (✓ → ✗), tokens (13439 → 56419), tool calls (0 → 9), time (36.7s → 52.6s)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

github-actions Bot added a commit that referenced this pull request Apr 1, 2026
@JanKrivanek
Copy link
Copy Markdown
Member Author

🎬 Session Replay

Evaluation sessions captured for this PR are available for interactive replay in AGENTVIZ:

▶️ Open Session Replay (PR #494)

3 sessions captured (baseline, isolated, plugin) for dotnet / Test a C# language feature with a script.

Each session shows the full agent conversation timeline -- tool calls, reasoning, context reads -- as an interactive replay you can scrub through, inspect events, and compare roles.


This link is auto-generated by the \publish-session-data\ + \comment-on-pr\ pipeline jobs after evaluation completes.

@JanKrivanek
Copy link
Copy Markdown
Member Author

🎬 Session Replay

Evaluation sessions captured for this PR are available for interactive replay in AGENTVIZ:

▶️ Open Session Replay (PR #494)

3 sessions captured (baseline, isolated, plugin) for dotnet / Test a C# language feature with a script.

Each session shows the full agent conversation timeline -- tool calls, reasoning, context reads -- as an interactive replay you can scrub through, inspect events, and compare roles.


This link is auto-generated by the publish-session-data + comment-on-pr pipeline jobs after evaluation completes.

@JanKrivanek JanKrivanek marked this pull request as ready for review April 1, 2026 15:55
@JanKrivanek JanKrivanek requested a review from ViktorHofer as a code owner April 1, 2026 15:55
Copilot AI review requested due to automatic review settings April 1, 2026 15:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Integrates AGENTVIZ session replay into the evaluation workflow and the GitHub Pages dashboard by preserving eval session artifacts, publishing a session manifest + JSONL data to a dedicated branch, and adding dashboard/PR links to open the replay UI.

Changes:

  • Persist eval session artifacts (--keep-sessions) and publish flattened session JSONL + manifest.json to dashboard-session-data.
  • Deploy/update the AGENTVIZ SPA under gh-pages/replay/ and add replay links to PR comments.
  • Add a per-plugin “Sessions Visualisation” link in the dashboard UI.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
.github/workflows/evaluation.yml Adds session publishing + AGENTVIZ SPA build/deploy and PR replay links.
eng/dashboard/build-replay-sessions.ps1 Builds AGENTVIZ-compatible session directory structure and manifest from eval artifacts.
eng/dashboard/purge-replay-sessions.ps1 Merges new session data, purges by retention, and regenerates manifest.
eng/dashboard/dashboard.js Adds a per-plugin link to open AGENTVIZ replay with manifest + tag filters.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread eng/dashboard/build-replay-sessions.ps1 Outdated
Comment thread eng/dashboard/purge-replay-sessions.ps1 Outdated
Comment thread eng/dashboard/purge-replay-sessions.ps1 Outdated
Comment thread .github/workflows/evaluation.yml Outdated
Comment thread .github/workflows/evaluation.yml Outdated
Comment thread .github/workflows/evaluation.yml Outdated
- Use [IO.Path]::PathSeparator instead of hardcoded ';' for cross-platform PATH
- Compute cutoff date in UTC for correct retention comparisons
- Precompute ID HashSet before merge loop to avoid O(n^2)
- Pin actions/setup-node to commit SHA (49933ea5...#v4)
- Pin AGENTVIZ clone to commit SHA with verification
- Skip npm ci + build when deployed commit SHA matches pinned SHA
- Resolve AGENTVIZ target SHA via git ls-remote (no clone)
- Read deployed SHA via curl from raw.githubusercontent.com (no clone)
- Cache build output keyed by commit SHA (skip npm ci+build on hit)
- Only clone AGENTVIZ repo on cache miss when SPA needs rebuild
Copilot AI review requested due to automatic review settings April 1, 2026 16:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

eng/dashboard/purge-replay-sessions.ps1:92

  • Retention logic for PR sessions relies on $file.LastWriteTimeUtc, but these files are sourced from a git checkout (dashboard-session-data branch) where mtimes are typically set to checkout time, not original creation time. This means old sessions/pr/<number>/... files will likely never be marked expired and will continue to be copied into the merged output (even if they’re filtered out of the manifest later). Consider determining expiry for PR sessions from the existing manifest.json mtime values (and copying only URLs present in the retained manifest), or include a date component in the PR directory structure so expiry can be computed from the path similarly to scheduled runs.
if (Test-Path $existingSessionsDir) {
    # For scheduled runs, the path structure is sessions/scheduled/<date>/...
    # For PR runs, the path structure is sessions/pr/<number>/...
    # We check dated directories for retention, PR dirs are always kept within window (use file mtime)

    $existingFiles = Get-ChildItem -Path $existingSessionsDir -Recurse -File -ErrorAction SilentlyContinue
    foreach ($file in $existingFiles) {
        $relativePath = $file.FullName.Substring($existingSessionsDir.Length).TrimStart([IO.Path]::DirectorySeparatorChar, [IO.Path]::AltDirectorySeparatorChar)
        $destPath = Join-Path $sessionsWorkDir $relativePath

        # Skip if already copied from new data
        if (Test-Path $destPath) { continue }

        # Check retention: try to extract date from path (scheduled/YYYY-MM-DD/...)
        $isExpired = $false
        if ($relativePath -match 'scheduled[/\\](\d{4}-\d{2}-\d{2})[/\\]') {
            $dirDate = [DateTime]::ParseExact($Matches[1], 'yyyy-MM-dd', $null)
            if ($dirDate -lt $cutoffDate) {
                $isExpired = $true
            }
        } elseif ($file.LastWriteTimeUtc -lt $cutoffDate) {
            # For PR sessions without date in path, use file modification time
            $isExpired = $true
        }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread eng/dashboard/purge-replay-sessions.ps1 Outdated
Comment thread eng/dashboard/purge-replay-sessions.ps1 Outdated
Comment thread eng/dashboard/build-replay-sessions.ps1 Outdated
Comment thread .github/workflows/evaluation.yml Outdated
Comment thread .github/workflows/evaluation.yml Outdated
Comment thread .github/workflows/evaluation.yml Outdated
Comment thread eng/dashboard/purge-replay-sessions.ps1
- Compare scheduled dir dates at day granularity (dirDate.Date vs cutoffDate.Date)
- Fix useTempDir null-safe path comparison via GetFullPath + null guard
- Use UTC for dateTag in build-replay-sessions.ps1 (consistent with purge)
- Fix cache/deploy path mismatch: deploy from /tmp/agentviz-dist (cache path)
- Deterministic clone: full clone + git checkout TARGET_SHA (fail on mismatch)
- URL-encode manifest param in PR comment link (jq @uri)
- Derive manifest generated timestamp from newest session mtime to avoid churn
@JanKrivanek
Copy link
Copy Markdown
Member Author

/evaluate

@JanKrivanek JanKrivanek enabled auto-merge (squash) April 1, 2026 17:14
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 1, 2026

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
mtp-hot-reload Suggest hot reload for failing test in MTP project (SDK 9) 1.0/5 → 2.0/5 ⏰ 🟢 ✅ mtp-hot-reload; tools: skill / ✅ mtp-hot-reload; tools: skill, stop_bash
mtp-hot-reload Suggest hot reload for failing test in MTP project (SDK 10) 1.0/5 → 4.0/5 🟢 ✅ mtp-hot-reload; tools: skill, bash, create
mtp-hot-reload Enable hot reload when package already installed 2.0/5 → 5.0/5 🟢 ✅ mtp-hot-reload; tools: skill
mtp-hot-reload Suggest launchSettings.json configuration for hot reload 1.0/5 → 4.0/5 🟢 ✅ mtp-hot-reload; tools: skill, bash, create
mtp-hot-reload Use dotnet run not dotnet test for hot reload 1.0/5 → 4.0/5 🟢 ✅ mtp-hot-reload; tools: skill
mtp-hot-reload Negative: VSTest project cannot use MTP hot reload 1.0/5 → 2.0/5 ⏰ 🟢 ✅ mtp-hot-reload; tools: skill, create
mtp-hot-reload Run specific failing test with hot reload filter 1.0/5 → 3.0/5 🟢 ✅ mtp-hot-reload; tools: skill
migrate-vstest-to-mtp Migrate MSTest project from VSTest to Microsoft.Testing.Platform 4.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: report_intent, skill
migrate-vstest-to-mtp Migrate NUnit project from VSTest to Microsoft.Testing.Platform 1.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: report_intent, skill
migrate-vstest-to-mtp Migrate xUnit.net v2 project from VSTest to Microsoft.Testing.Platform 2.0/5 → 4.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill, report_intent, bash / ✅ migrate-vstest-to-mtp; tools: skill
migrate-vstest-to-mtp Update Azure DevOps pipeline from VSTest task to MTP 2.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill
migrate-vstest-to-mtp Migrate MSTest.Sdk project that explicitly uses VSTest 3.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill
migrate-vstest-to-mtp Translate dotnet test VSTest arguments to MTP equivalents 3.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill
migrate-vstest-to-mtp Handle exit code 8 when migrating from VSTest to MTP 3.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill / ⚠️ NOT ACTIVATED
migrate-vstest-to-mtp Configure dotnet test MTP mode on .NET 10 SDK 2.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill
migrate-vstest-to-mtp Migrate xUnit.net VSTest filter syntax to MTP 2.0/5 → 4.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill
migrate-vstest-to-mtp Full VSTest to MTP migration plan for MSTest solution 1.0/5 ⏰ → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill
migrate-mstest-v1v2-to-v3 Migrate MSTest v1 project with assembly reference 3.0/5 → 5.0/5 🟢 ✅ migrate-mstest-v1v2-to-v3; tools: skill, edit, bash ✅ 0.04
migrate-mstest-v1v2-to-v3 Migrate MSTest v2 NuGet project to v3 4.0/5 → 3.0/5 🔴 ✅ migrate-mstest-v1v2-to-v3; tools: skill ✅ 0.04
migrate-mstest-v1v2-to-v3 Fix Assert.AreEqual object overload errors after v3 upgrade 3.0/5 → 5.0/5 🟢 ✅ migrate-mstest-v1v2-to-v3; tools: skill, edit ✅ 0.04
migrate-mstest-v1v2-to-v3 Migrate from .testsettings to .runsettings 3.0/5 → 4.0/5 🟢 ✅ migrate-mstest-v1v2-to-v3; tools: skill, bash / ✅ migrate-mstest-v1v2-to-v3; tools: skill ✅ 0.04
migrate-mstest-v1v2-to-v3 Fix DataRow type mismatch errors after v3 upgrade 4.0/5 → 3.0/5 🔴 ✅ migrate-mstest-v1v2-to-v3; tools: skill ✅ 0.04
migrate-mstest-v1v2-to-v3 Migrate to MSTest.Sdk project style 3.0/5 → 5.0/5 🟢 ✅ migrate-mstest-v1v2-to-v3; tools: skill, bash ✅ 0.04
migrate-mstest-v1v2-to-v3 Handle dropped target framework during v3 migration 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED ✅ 0.04 [1]
migrate-mstest-v1v2-to-v3 Migrate complex MSTest v2 project with testsettings, DataRow issues, and dropped TFM 4.0/5 → 5.0/5 🟢 ✅ migrate-mstest-v1v2-to-v3; tools: skill ✅ 0.04
migrate-mstest-v1v2-to-v3 Correctly identify MSTest v1 vs v2 and recommend different migration paths 3.0/5 → 5.0/5 🟢 ✅ migrate-mstest-v1v2-to-v3; tools: skill, task, glob, read_agent / ✅ migrate-mstest-v1v2-to-v3; tools: skill, task, glob, read_agent, bash ✅ 0.04
migrate-mstest-v3-to-v4 Migrate custom TestMethodAttribute from Execute to ExecuteAsync 2.0/5 → 3.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill
migrate-mstest-v3-to-v4 Replace ExpectedExceptionAttribute with Assert.ThrowsExactly 1.0/5 ⏰ → 4.0/5 ⏰ 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill / ⚠️ NOT ACTIVATED
migrate-mstest-v3-to-v4 Fix multiple v4 breaking changes: Assert, ClassCleanup, TestContext, Timeout 3.0/5 ⏰ → 4.0/5 ⏰ 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill
migrate-mstest-v3-to-v4 Handle net6.0 target framework dropped in MSTest v4 3.0/5 → 5.0/5 🟢 ⚠️ NOT ACTIVATED
migrate-mstest-v3-to-v4 Fix TestMethodAttribute CallerInfo constructor breaking change 3.0/5 → 4.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill
migrate-mstest-v3-to-v4 Understand behavioral changes after MSTest v4 upgrade 3.0/5 → 5.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill
migrate-mstest-v3-to-v4 Handle MSTest.Sdk and MTP changes in v4 2.0/5 → 3.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill
migrate-mstest-v3-to-v4 Full MSTest v3 to v4 migration with multiple breaking changes 3.0/5 → 5.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill
migrate-mstest-v3-to-v4 Migrate MSTest.Sdk v3 project using ManagedType and TestTimeout 3.0/5 → 4.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill
migrate-mstest-v3-to-v4 Correctly identify MSTest v3 project and recommend v4 migration 4.0/5 → 5.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill
run-tests Run tests in a VSTest MSTest project 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill
run-tests Run tests with trx reporting on MTP project (SDK 9) 4.0/5 → 4.0/5 ✅ run-tests; tools: skill
run-tests Run tests with blame-hang on MTP project (SDK 10) 2.0/5 → 2.0/5 ⏰ ✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED
run-tests Run tests in a multi-TFM project targeting a specific framework 2.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, bash / ✅ run-tests; tools: skill, bash, glob
run-tests Filter MSTest tests by category on VSTest 5.0/5 → 5.0/5 ✅ run-tests; tools: skill, bash / ✅ run-tests; tools: skill [2]
run-tests Filter NUnit tests by class name on VSTest 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, bash / ⚠️ NOT ACTIVATED
run-tests Filter xUnit v3 tests by class on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, bash / ✅ run-tests; tools: skill
run-tests Filter xUnit v3 tests by trait on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view
run-tests Filter TUnit tests by class using treenode-filter 2.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, bash / ⚠️ NOT ACTIVATED
run-tests Combine multiple filter criteria on VSTest MSTest 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED / ✅ run-tests; tools: skill [3]
run-tests MTP project on SDK 9 must use -- separator for args 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED
run-tests MTP project on SDK 10 passes args directly 2.0/5 → 4.0/5 🟢 ✅ run-tests; tools: skill / ✅ run-tests; tools: skill, create
run-tests Detect test platform from Directory.Build.props 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill
run-tests Negative test: do not use MTP syntax for a VSTest project 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view / ⚠️ NOT ACTIVATED [4]
writing-mstest-tests Write unit tests for a service class 4.0/5 → 4.0/5 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.29
writing-mstest-tests Write data-driven tests for a calculator 5.0/5 → 4.0/5 🔴 ✅ writing-mstest-tests; tools: skill 🟡 0.29 [5]
writing-mstest-tests Write async tests with cancellation 2.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill 🟡 0.29
writing-mstest-tests Fix swapped Assert.AreEqual arguments 5.0/5 → 5.0/5 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.29 [6]
writing-mstest-tests Modernize legacy test patterns 5.0/5 → 4.0/5 🔴 ✅ writing-mstest-tests; tools: skill 🟡 0.29
writing-mstest-tests Replace ExpectedException with Assert.Throws 3.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill 🟡 0.29
writing-mstest-tests Use proper collection assertions 3.0/5 → 3.0/5 ✅ writing-mstest-tests; tools: skill 🟡 0.29 [7]
writing-mstest-tests Use proper type assertions instead of casts 3.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill 🟡 0.29
writing-mstest-tests Set up test lifecycle correctly 3.0/5 → 4.0/5 🟢 ✅ writing-mstest-tests; tools: skill 🟡 0.29
writing-mstest-tests Use DynamicData with ValueTuples over object arrays 1.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.29
crap-score Calculate CRAP score for a single method with partial coverage 4.0/5 → 5.0/5 🟢 ✅ crap-score; tools: skill ✅ 0.13
crap-score Identify riskiest methods across a file 4.0/5 → 5.0/5 🟢 ✅ crap-score; tools: skill, glob / ✅ crap-score; tools: skill ✅ 0.13
crap-score Generate coverage then compute CRAP score 4.0/5 → 4.0/5 ✅ crap-score; tools: skill ✅ 0.13 [8]
code-testing-agent Generate tests for ContosoUniversity ASP.NET Core MVC app 3.0/5 → 3.0/5 ✅ code-testing-agent; tools: skill, grep / ✅ code-testing-agent; tools: skill ✅ 0.02
test-anti-patterns Detect mixed severity anti-patterns in repository service tests 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED ✅ 0.07 [9]
test-anti-patterns Detect flakiness indicators and test coupling 3.0/5 → 4.0/5 🟢 ✅ test-anti-patterns; tools: report_intent, skill ✅ 0.07
test-anti-patterns Detect duplicated tests and magic values 3.0/5 → 4.0/5 ⏰ 🟢 ✅ test-anti-patterns; tools: report_intent, skill ✅ 0.07
test-anti-patterns Recognize well-written tests without inventing false positives 2.0/5 → 5.0/5 🟢 ✅ test-anti-patterns; tools: report_intent, skill ✅ 0.07
directory-build-organization Organize build infrastructure for a multi-project repo 3.0/5 → 5.0/5 🟢 ✅ directory-build-organization; tools: skill, read_agent / ⚠️ NOT ACTIVATED
msbuild-antipatterns Review MSBuild files for anti-patterns and style issues 3.0/5 ⏰ → 1.0/5 ⏰ 🔴 ✅ msbuild-antipatterns; tools: skill, glob / ⚠️ NOT ACTIVATED ✅ 0.06
binlog-generation Build project with /bl flag 1.0/5 → 5.0/5 🟢 ✅ binlog-generation; tools: skill [10]
binlog-generation Build with /bl in PowerShell 4.0/5 → 5.0/5 🟢 ✅ binlog-generation; tools: skill
binlog-generation Build multiple configurations with unique binlogs 5.0/5 → 5.0/5 ✅ binlog-generation; tools: skill / ⚠️ NOT ACTIVATED [11]
msbuild-server Recommend MSBuild Server for slow CLI incremental builds 3.0/5 → 5.0/5 🟢 ✅ msbuild-server; tools: skill / ✅ msbuild-server; tools: skill, bash 🟡 0.37
binlog-failure-analysis Diagnose build failures from binlog only (no source files) 1.0/5 ⏰ → 1.0/5 ⏰ ✅ binlog-failure-analysis; tools: skill ✅ 0.05
incremental-build Analyze incremental build issues 2.0/5 ⏰ → 1.0/5 ⏰ 🔴 ✅ incremental-build; tools: skill, bash ✅ 0.13
check-bin-obj-clash Diagnose bin/obj output path clashes 5.0/5 → 5.0/5 ✅ check-bin-obj-clash; tools: skill ✅ 0.14 [12]
build-perf-diagnostics Diagnose slow build for a small project 3.0/5 → 1.0/5 ⏰ 🔴 ⚠️ NOT ACTIVATED ✅ 0.16
eval-performance Analyze MSBuild evaluation performance issues 4.0/5 → 4.0/5 ✅ eval-performance; tools: skill, bash / ✅ eval-performance; tools: skill ✅ 0.15
resolve-project-references Explain misleading ResolveProjectReferences time 3.0/5 → 5.0/5 🟢 ✅ resolve-project-references; tools: skill ✅ 0.14
build-perf-baseline Establish build performance baseline and recommend optimizations 3.0/5 → 4.0/5 🟢 ✅ build-perf-baseline; tools: skill, glob / ⚠️ NOT ACTIVATED 🟡 0.26
build-parallelism Analyze build parallelism bottlenecks 4.0/5 → 1.0/5 ⏰ 🔴 ✅ build-parallelism; tools: skill, glob / ⚠️ NOT ACTIVATED ✅ 0.14
msbuild-modernization Modernize legacy project to SDK-style 5.0/5 → 5.0/5 ✅ msbuild-modernization; tools: skill ✅ 0.06 [13]
including-generated-files Diagnose generated file inclusion failure 3.0/5 → 5.0/5 🟢 ⚠️ NOT ACTIVATED / ✅ including-generated-files; tools: skill 🟡 0.26

[1] (Plugin) Quality unchanged but weighted score is -0.7% due to: tokens (24255 → 31041)
[2] (Plugin) Quality unchanged but weighted score is -3.4% due to: time (13.0s → 21.8s), tokens (36760 → 49120)
[3] (Isolated) Quality unchanged but weighted score is -3.5% due to: tokens (24791 → 39453), tool calls (3 → 4)
[4] (Plugin) Quality unchanged but weighted score is -0.5% due to: tokens (24237 → 30041)
[5] (Plugin) Quality unchanged but weighted score is -1.9% due to: tokens (246590 → 511374), tool calls (19 → 29), time (97.3s → 133.6s)
[6] (Plugin) Quality unchanged but weighted score is -0.8% due to: tokens (12068 → 15051)
[7] (Plugin) Quality unchanged but weighted score is -16.3% due to: quality, tokens (12370 → 33755), tool calls (0 → 1), time (11.6s → 15.5s)
[8] (Isolated) Quality unchanged but weighted score is -17.2% due to: judgment, quality, tokens (298616 → 403411)
[9] (Plugin) Quality unchanged but weighted score is -1.1% due to: tokens (13343 → 16287)
[10] (Isolated) Quality improved but weighted score is -2.6% due to: tokens (49026 → 68308), time (87.6s → 109.0s)
[11] (Plugin) Quality unchanged but weighted score is -1.4% due to: tokens (36532 → 47830)
[12] (Plugin) Quality unchanged but weighted score is -10.0% due to: tokens (69215 → 350689), tool calls (8 → 21), time (36.3s → 114.4s)
[13] (Plugin) Quality unchanged but weighted score is -2.7% due to: tokens (73707 → 126120)

timeout — run(s) hit the (120s, 160s, 180s, 240s, 300s, 360s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

@JanKrivanek JanKrivanek merged commit d861bbf into main Apr 2, 2026
31 checks passed
@JanKrivanek JanKrivanek deleted the dev/jankrivanek/agentviz-integration-poc branch April 2, 2026 15:40
sayedihashimi pushed a commit to sayedihashimi/skills that referenced this pull request Apr 20, 2026
* Add AGENTVIZ session replay integration

- Add workflow_dispatch trigger to evaluation.yml
- Add --keep-sessions to skill-validator evaluate step
- Add publish-session-data job (mirrors publish-token-data)
- Add replay link to PR comments (comment-on-pr)
- Add AGENTVIZ SPA build/deploy to deploy-dashboard job
- Add setup-node step to deploy-dashboard
- Add per-plugin Sessions Visualisation links to dashboard.js
- Create build-replay-sessions.ps1 (manifest generation from sessions.db)
- Create purge-replay-sessions.ps1 (7-day retention management)

* Address code review feedback on PR dotnet#494

- Use [IO.Path]::PathSeparator instead of hardcoded ';' for cross-platform PATH
- Compute cutoff date in UTC for correct retention comparisons
- Precompute ID HashSet before merge loop to avoid O(n^2)
- Pin actions/setup-node to commit SHA (49933ea5...#v4)
- Pin AGENTVIZ clone to commit SHA with verification
- Skip npm ci + build when deployed commit SHA matches pinned SHA

* Remove hardcoded AGENTVIZ SHA; use cache + zero-clone checks

- Resolve AGENTVIZ target SHA via git ls-remote (no clone)
- Read deployed SHA via curl from raw.githubusercontent.com (no clone)
- Cache build output keyed by commit SHA (skip npm ci+build on hit)
- Only clone AGENTVIZ repo on cache miss when SPA needs rebuild

* Address round 2 code review feedback on PR dotnet#494

- Compare scheduled dir dates at day granularity (dirDate.Date vs cutoffDate.Date)
- Fix useTempDir null-safe path comparison via GetFullPath + null guard
- Use UTC for dateTag in build-replay-sessions.ps1 (consistent with purge)
- Fix cache/deploy path mismatch: deploy from /tmp/agentviz-dist (cache path)
- Deterministic clone: full clone + git checkout TARGET_SHA (fail on mismatch)
- URL-encode manifest param in PR comment link (jq @uri)
- Derive manifest generated timestamp from newest session mtime to avoid churn
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants