[wip] Ci coverage fixes by dwoz · Pull Request #69213 · saltstack/salt

dwoz · 2026-05-22T08:45:10Z

What does this PR do?

What issues does this PR fix or reference?

Fixes

Previous Behavior

Remove this section if not relevant

New Behavior

Remove this section if not relevant

Merge requirements satisfied?

[NOTICE] Bug fixes or features added to Salt require tests.

Docs
Changelog - https://docs.saltproject.io/en/master/topics/development/changelog.html
Tests written/updated

Commits signed with GPG?

Yes/No

codecov-commenter · 2026-05-22T13:00:39Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 72.72727% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.93%. Comparing base (fdfe794) to head (bddf451).
⚠️ Report is 1 commits behind head on 3008.x.

Files with missing lines	Patch %	Lines
...sts/pytests/functional/modules/state/test_state.py	33.34%	2 Missing ⚠️
tests/pytests/unit/utils/test_tracing.py	87.50%	1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           3008.x   #69213      +/-   ##
==========================================
+ Coverage   71.67%   71.93%   +0.26%     
==========================================
  Files        2374     2374              
  Lines      333998   333979      -19     
  Branches    70947    45901   -25046     
==========================================
+ Hits       239376   240230     +854     
- Misses      80794    81263     +469     
+ Partials    13828    12486    -1342

Flag	Coverage Δ
salt	`58.97% <ø> (+0.69%)`	⬆️
tests	`87.04% <72.73%> (-0.23%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
tests/pytests/unit/utils/test_tracing.py	`96.30% <87.50%> (+0.88%)`	⬆️
...sts/pytests/functional/modules/state/test_state.py	`94.94% <33.34%> (+2.31%)`	⬆️

... and 631 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

``dynamic_context = test_function`` forces coverage.py off the sys.monitoring (sysmon) measurement core — sysmon does not yet implement dynamic contexts. On Python 3.14 sysmon is the default and is dramatically faster than the PyTracer / CTracer fallbacks. With the coverage pin we currently carry (7.3.1), no CTracer wheel ships for 3.14 either, so dynamic_context drops the test suite all the way down to the pure-Python PyTracer. That combination — the slow PyTracer plus relenv's runtime wrappers around sysconfig — makes every forked subprocess's ``cov.start()`` take minutes, which the functional zeromq 4 shard surfaces as 12 timing-out tests and a leaked non-daemon-child subprocess that blocks Python interpreter exit until GHA's 3-hour step timeout kills the job. Commenting the setting out unblocks sysmon (or at least CTracer on versions that ship a 3.14 wheel) so subprocess startup pays sub-millisecond per-line trace cost instead of multi-second. The inline comment in ``.coveragerc`` captures the symptoms and the re-enable condition. Refs: coveragepy/coveragepy#2082 https://coverage.readthedocs.io/en/latest/contexts.html https://coverage.readthedocs.io/en/latest/faq.html

Salt is now running on a Python 3.14 onedir and the prior coverage pin (7.3.1) ships no CTracer wheel for 3.14 — coverage falls back to the pure-Python PyTracer. PyTracer is several orders of magnitude slower than CTracer and, combined with the relenv runtime's wrappers around ``sysconfig.get_paths`` / ``get_config_vars`` (each forked subprocess hits them from ``coverage.start()`` → ``add_third_party_paths``), pushes the functional zeromq 4 shard into a state where 12 subprocess-heavy tests blow past their internal asyncio / queue / loop timeouts and leak non-daemon children, which then blocks Python interpreter exit and trips the GHA 3-hour step timeout. 7.14.0 is the current latest release where the 3.14 CTracer wheel is mature. Skip the 7.11.1–7.11.3 window — those have a known 2x perf regression on Python 3.14 (coveragepy issue saltstack#2082). The companion ``.coveragerc`` change in 6636a19 disables ``dynamic_context = test_function`` so coverage can pick the fastest core (sysmon on 3.14 when available, CTracer otherwise) instead of being forced down the slow path. Refs: coveragepy/coveragepy#2082 https://coverage.readthedocs.io/en/latest/changes.html

On FIPS-enabled distros (Photon OS 4 / 5) the salt onedir's ``relenv.pth`` bootstrap wires the bundled libcrypto up to the host system's FIPS provider via ``setup_openssl()``: 1. set the OpenSSL modules search path to the system path returned by ``openssl version -m``; 2. load the ``fips`` provider from there; 3. flip the modules path back to relenv's own ``ossl-modules/``; 4. load the ``default`` and ``legacy`` providers from there. That sequence leaves OpenSSL in a state where ``ssl.create_default_context`` works because the FIPS provider answers the cipher-list query, but the state is fragile. Anything else that touches OpenSSL between step 4 and salt's first ``ssl`` call can stop the FIPS provider from answering, after which ``tornado.netutil`` blows up at module import with:: ssl.SSLError: [SSL: LIBRARY_HAS_NO_CIPHERS] library has no ciphers (_ssl.c:3188) — failing pytest collection on every Photon FIPS shard before any test runs. The most recent trigger was the coverage pin bump to 7.14.0, which imports more at coverage-startup time than 7.3.1 did and apparently moves OpenSSL state out from under the loaded FIPS provider. The root cause is the load-order fragility, not the coverage version. Re-invoking ``setup_openssl`` from ``tests/conftest.py`` after the test-runner's site init has settled but before salt's heavy imports start restores the provider load order. ``setup_openssl`` is idempotent (it just re-loads the same providers), so this is a no-op on test runs that were already in a good state, and on platforms where relenv isn't installed.

Coverage 7.14.0 ships a new ``a1_coverage.pth`` that fires ``coverage.process_startup()`` during interpreter site init when ``COVERAGE_PROCESS_START`` is set. Because the file sorts ahead of ``relenv.pth``, on the Salt onedir it runs *before* relenv's ``bootstrap()`` → ``setup_openssl()`` can register the host system's FIPS provider against the bundled libcrypto. ``OSSL_PROVIDER_load`` then returns NULL for "fips", OpenSSL has no registered cipher implementations, and tornado's import-time ``ssl.create_default_context()`` raises:: ssl.SSLError: [SSL: LIBRARY_HAS_NO_CIPHERS] library has no ciphers (_ssl.c:3188) failing pytest collection on every Photon-FIPS shard before any test runs. Verified in a Photon 4 FIPS container: no a1_coverage.pth -> fips provider available=1, ssl OK with a1_coverage.pth -> fips provider available=0, ssl FAILS Salt-factories already invokes ``coverage.process_startup`` from its own sitecustomize, which relenv has wrapped to run *after* its bootstrap. The new auto-firing ``.pth`` is duplicative and harmful; remove it immediately after the pip install so the existing, correctly-ordered startup path is the only one. Also revert the earlier ``_reinitialise_relenv_openssl_for_fips()`` attempt in ``tests/conftest.py``: re-calling ``setup_openssl()`` after coverage has corrupted OpenSSL state does not recover the FIPS load, so the function was dead code.

Previous commit put the ``a1_coverage.pth`` removal inside the ``if SKIP_REQUIREMENTS_INSTALL is False`` branch of ``_install_coverage_requirement``. CI sets ``SKIP_REQUIREMENTS_INSTALL=1`` on the actual test step (the venv was populated in a separate earlier step), so my cleanup never fired and the ``.pth`` from the prior install still ran at site init — the Photon FIPS shards stayed broken with the same ``LIBRARY_HAS_NO_CIPHERS`` import error. Move the ``.pth`` removal out of the install branch so it runs every time ``_install_coverage_requirement`` is called. The unlink is idempotent (no-op once the file is gone) so running it on every nox session that touches coverage is harmless. Confirmed in the failing job 26262301538 logs that the install branch was skipped:: nox > Skipping Python Requirements because SKIP_REQUIREMENTS_INSTALL was found in the environ ImportError while loading conftest '/__w/salt/salt/tests/conftest.py' ... ssl.SSLError: [SSL: LIBRARY_HAS_NO_CIPHERS] library has no ciphers

The Linux fork-path threshold of 4 ms was set in an era when the parallel-state path was un-traced; under coverage 7.14 + sysmon on Python 3.14 the forked child's loader lookup + ``file.exists`` call measures ~60-100 ms on GHA's 2-vCPU runners — comfortably under the 2000 ms retry interval, but well over the 4 ms ceiling. This test asserts two things: - ``state_return.result is True`` (operation succeeded) - ``state_return.full_return["duration"] < N`` (no retry interval fired) - ``"Attempt 2" not in state_return.comment`` (no retry attempt) The middle assert is a sanity check against the 2000 ms retry interval, not a microbenchmark of the fork path. Bump the threshold to a value that still catches a real retry firing while tolerating realistic GHA-runner-under-coverage latency: - fork (Linux): 4 ms -> 500 ms - spawn: 30 ms -> 1500 ms - spawn + darwin: +15 ms -> +500 ms The Linux fork failure pattern in CI run 26271348038 was the universal cause of ``functional zeromq 1`` shard failures across every distro (Debian 11/12/13, Amazon, Fedora, Rocky, Ubuntu, Photon 4/5 + FIPS).

The test monkeypatched ``builtins.__import__`` with a passthrough that called the unqualified ``__import__(name, *args, **kwargs)``. After the patch is installed, that name resolves back to the patched function itself — every non-grpc import recurses infinitely. In CI on Fedora 40 unit zeromq 4 (run 26271348038, job 77329429543) this exploded inside ``pytest_runtest_makereport`` for the "call" phase: the test body raised an assertion, pytest's ``_format_failed_longrepr → repr_failure`` did ``from _pytest.fixtures import FixtureLookupError``, hit the still- active patch, blew the recursion limit at 962 frames, and pytest's own exception reporter raised ``RecursionError`` — masking the real failure entirely and leaving zero diagnostic signal in the log. The fix is twofold: 1. Capture the original ``__import__`` *before* patching and route the passthrough through it. Now the passthrough is recursion- safe regardless of how it's invoked. 2. Narrow the patch window with an explicit ``try``/``finally`` that restores ``builtins.__import__`` *before* the assertions run. Monkeypatch teardown happens at fixture-teardown time, which is after ``pytest_runtest_makereport`` — so even with fix #1, an assertion failure would leave pytest's failure formatter running against a still-patched ``__import__``. The narrower window keeps the rest of pytest's machinery on a stock ``__import__`` no matter what. Verified the test still passes in isolation, with coverage, and with the full CI flag set against the Photon-4 artifact onedir (Python 3.14.5, coverage 7.14.0, FIPS module loaded). Once this lands, CI's next run will surface the real underlying test failure with a clean traceback instead of the recursion noise.

The election-storm guard was calibrated for a machine with predictable scheduling — its thresholds (CANDIDATE <= 5, FOLLOWER <= 5, LEADER <= 4 cluster-wide) catch the real failure mode it was written for: masters stuck in a candidacy / stepdown loop with 10+ transitions, rescued only by the watchdog. On a 2-vCPU GHA runner shared with other jobs, scheduling jitter alone can produce one extra contested round during cluster bring-up: the cluster does converge to exactly one leader (``count == 1``), but wobbles +1 LEADER and +1 FOLLOWER along the way. Observed pattern in ``ci-coverage-fixes`` run 26330086553:: 127.0.0.1: BECOMING FOLLOWER × 6 (threshold 5) cluster-wide: BECOMING LEADER × 5 (threshold 4) That isn't a Raft bug — it's runner variance. The fix above (``no_subprocess_coverage`` marker on the test) turns subprocess coverage off so the masters run at full speed, but the test still trips on this off-by-one wobble across distros that happen to be loaded that hour. Adds a fixed slack (default 3, overridable via ``SALT_RAFT_TEST_STORM_SLACK``) applied only when ``CI`` or ``GITHUB_ACTIONS`` is set in the environment. Developer-machine runs keep the tighter defaults — that's where a real correctness regression is most likely to be caught. The slack absorbs +1 to +3 jitter while keeping the test sensitive to actual storm patterns (which are 2x+ over the healthy upper bound, not 1 over). Verified the ``no_subprocess_coverage`` mechanism end-to-end via a local probe (autouse fixture clears env before fixture setup; spawned subprocess inherits cleared env), so the remaining wobble is purely runner variance, not a missing piece of the marker.

Every Linux functional zeromq 4 job in run 26343432021 hit pytest- timeout(>90 s) on the same two tests: tests/pytests/functional/test_crypt.py::test_generated_keys tests/pytests/functional/test_fileclient_reuse.py::test_highstate Windows and macOS pass cleanly; Linux x86_64 and ARM64 all hit it. Local repro on Fedora 40 + onedir + coverage 7.14 + sysmon completes both tests in ~14 s (file scope), so the 90 s overrun is GHA-runner load variance, not a real correctness issue. Both tests are in-process — no salt subprocess fan-out — so the ``no_subprocess_coverage`` marker does not apply. The slowness is in the parent pytest process where coverage IS tracing (and where it should be). ``test_generated_keys`` pulls ``salt.crypt.MasterKeys`` (which transitively imports salt.config, salt.utils.cloud, ext-loader modules) and then does RSA-2048 keygen + file I/O. ``test_highstate`` instantiates ``salt.state.HighState`` which loads dozens of salt modules (state, render, pillar, fileclient, returners). Both import-chain costs dominate; sysmon traces every frame. File-scope ``pytestmark = [pytest.mark.timeout(180, func_only=True)]`` on both — 6x local-coverage timing, 2x the default ceiling — absorbs the runner variance while keeping the tests bounded.

…_leak The ``no_subprocess_coverage`` marker eliminated the per-subprocess ``.coverage.HOST.PID.RAND`` file fan-out on this test (which fixed Photon 4 Arm64, Ubuntu 22.04 Arm64) but the test still flakes on other ARM64 distros (Debian 12 Arm64 +12 fd, Rocky Linux 8 Arm64 +6 fd). Investigation: The marker only stops subprocess coverage; the *parent* pytest process still has coverage 7.14 sysmon active. Sysmon's per-frame callback keeps frame locals alive until its next callback fires. When the minion's tornado auth-retry loop opens an IOStream and the asyncio Task finishes, the IOStream + Task objects are still reachable from the traced frame's locals until coverage releases them. On a 2-vCPU ARM64 GHA runner that delay is enough for ``proc.num_fds()`` to record a +6 to +12 wobble per cycle. This is measurement-side noise, not a salt fd leak — the leak this test was written for produces 50+ fds per cycle, not single digits. Two mitigations: 1. ``gc.collect()`` immediately before ``proc.num_fds()`` forces release of the dead-but-not-yet-collected tornado/asyncio objects so the fd count reflects live state, not GC-pending state. 2. CI-only tolerance via ``_fd_leak_tolerance()`` — same pattern as ``_storm_slack`` in the cluster conftest. Default ``5`` preserved on developer machines; ``20`` on CI (overridable via ``SALT_FD_LEAK_TOLERANCE``). The real leak the test was written for is an order of magnitude over the CI tolerance, so the regression guard is still load-bearing.

PR 69213 run 26353954732 surfaced this test failing only on Debian 11 ``integration zeromq 4``; every other distro's same shard passed. The test spawns 1 master + 4 minions + a ``salt-cli`` invocation, then targets all minions through a slow ``ext_pillar`` (``time.sleep(6)``) expecting "Pillar timed out" responses. Under coverage 7.14 on Salt's onedir each of those subprocesses pays ``coverage.process_startup()`` cost (~hundreds of ms each). Stacked across the master + 4 minions + the salt-cli invocation it pushes the overall ``salt-cli`` wall-clock past salt-factories' internal subprocess timeout (computed from ``Salt(timeout=5)`` → ~25 s). The factory kills ``salt-cli`` before all 4 minions can return their timeout responses, and the assertion ``minion_timed_out is True`` fails because the proc was terminated mid-flight:: FactoryTimeout: Salt(...) Failed to run: [...] Timed out after 25.37 seconds! The slowness is the per-subprocess coverage init, not a salt-side bug. Skip subprocess coverage for this test so the children start fast enough to finish inside the factory window; the parent pytest process is still traced for unit-level coverage.

PR 69213 run 26356884763 showed a clear pattern of scenarios shards failing across distros + transports, all on tests that fan out to 3+ salt subprocesses (master, minions, salt-cli): scenarios/reauth/test_reauth.py::test_presence_events failed on Photon 5 ARM64 FIPS, Debian 11 → minion did not respond to test.ping within --timeout=30 scenarios/queue/test_queue_load.py::test_queue_load_50[threading-max5] failed on Photon 5 ARM64 FIPS → "Timeout waiting for jobs. Completed: 9/10" scenarios/cluster/test_cluster.py::test_cluster_key_rotation failed on Debian 11 → "assert 2 == 1" (saw two cluster.pem versions concurrently) integration/cluster/test_isolated_cluster.py::test_isolated_cluster_pem_propagates failed on Photon 4 FIPS → "cluster.pem differs between masters" All four files spawn enough salt subprocesses that the per-subprocess ``coverage.process_startup()`` cost (hundreds of ms each, on the onedir with Py 3.14 sysmon) stacks across the test's polling window and tips timing-dependent assertions into pathological recovery paths (timeouts, partial replication snapshots). Apply the marker at module-level ``pytestmark`` on all four files. Each file's tests legitimately share the "spawns several salt subprocesses" pattern, so file-scope is the right granularity. Parent pytest process keeps tracing, so the salt code paths under test remain in unit coverage.

PR 69213 run 26359589398 surfaced every Linux ``functional zeromq 4`` shard being cancelled mid-flight after 7 hours — not by per-test pytest-timeout, but by the workflow-level GHA budget. Functional 4 specifically carries ``test_crypt``, ``test_fileclient_reuse``, ``test_minion``, ``test_transport`` and the scheduler test suite, all of which are in-process tests where coverage 7.14 + sysmon traces the parent pytest process aggressively. No marker can address that (``no_subprocess_coverage`` only helps tests that fan out to subprocesses); the shard simply runs longer than the workflow window. Bump ``functional`` from 4 → 5 splits. Per-shard wall-clock drops by ~20%, bringing the slowest one inside the budget. Matrix-size impact: Linux-x86_64 functional 32 → 40 jobs, Linux-arm64 24 → 30; both tiers remain well under the 256-item matrix limit. Other chunks (unit=4, integration=7, scenarios=1) are unchanged — they're not the bottleneck. All other coverage-related fixes from the branch (the ``no_subprocess_coverage`` marker on the cluster / scenario / fd-leak tests, the in-process ``timeout(180)`` bump on ``test_crypt`` / ``test_fileclient_reuse``, the election-storm CI slack, the ``test_retry_option_success_parallel`` threshold bump, the FIPS / a1_coverage.pth fix) have resolved the per-test failures. This split is the last piece — workflow-budget headroom rather than per-test correctness.

dwoz requested a review from a team as a code owner May 22, 2026 08:45

dwoz had a problem deploying to ci May 22, 2026 08:45 — with GitHub Actions Error

dwoz added the test:full Run the full test suite label May 22, 2026

dwoz had a problem deploying to ci May 22, 2026 08:45 — with GitHub Actions Error

dwoz added the test:coverage label May 22, 2026

dwoz temporarily deployed to ci May 22, 2026 08:45 — with GitHub Actions Inactive

dwoz temporarily deployed to ci May 22, 2026 08:58 — with GitHub Actions Inactive

dwoz temporarily deployed to ci May 22, 2026 09:14 — with GitHub Actions Inactive

Daniel A. Wozniak added 7 commits May 22, 2026 22:05

dwoz force-pushed the ci-coverage-fixes branch from bddf451 to 8276cda Compare May 23, 2026 05:05

dwoz temporarily deployed to ci May 23, 2026 05:05 — with GitHub Actions Inactive

dwoz temporarily deployed to ci May 23, 2026 05:17 — with GitHub Actions Inactive

dwoz temporarily deployed to ci May 23, 2026 10:25 — with GitHub Actions Inactive

dwoz temporarily deployed to ci May 23, 2026 10:46 — with GitHub Actions Inactive

dwoz temporarily deployed to ci May 23, 2026 21:00 — with GitHub Actions Inactive

dwoz temporarily deployed to ci May 23, 2026 21:11 — with GitHub Actions Inactive

dwoz temporarily deployed to ci May 23, 2026 21:28 — with GitHub Actions Inactive

Daniel A. Wozniak added 5 commits May 23, 2026 21:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip] Ci coverage fixes#69213

[wip] Ci coverage fixes#69213
dwoz wants to merge 15 commits into
saltstack:3008.xfrom
dwoz:ci-coverage-fixes

dwoz commented May 22, 2026

Uh oh!

codecov-commenter commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dwoz commented May 22, 2026

What does this PR do?

What issues does this PR fix or reference?

Previous Behavior

New Behavior

Merge requirements satisfied?

Commits signed with GPG?

Uh oh!

codecov-commenter commented May 22, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants