Skip to content

[wip] Ci coverage fixes#69213

Open
dwoz wants to merge 15 commits into
saltstack:3008.xfrom
dwoz:ci-coverage-fixes
Open

[wip] Ci coverage fixes#69213
dwoz wants to merge 15 commits into
saltstack:3008.xfrom
dwoz:ci-coverage-fixes

Conversation

@dwoz
Copy link
Copy Markdown
Contributor

@dwoz dwoz commented May 22, 2026

What does this PR do?

What issues does this PR fix or reference?

Fixes

Previous Behavior

Remove this section if not relevant

New Behavior

Remove this section if not relevant

Merge requirements satisfied?

[NOTICE] Bug fixes or features added to Salt require tests.

Commits signed with GPG?

Yes/No

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 72.72727% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.93%. Comparing base (fdfe794) to head (bddf451).
⚠️ Report is 1 commits behind head on 3008.x.

Files with missing lines Patch % Lines
...sts/pytests/functional/modules/state/test_state.py 33.34% 2 Missing ⚠️
tests/pytests/unit/utils/test_tracing.py 87.50% 1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           3008.x   #69213      +/-   ##
==========================================
+ Coverage   71.67%   71.93%   +0.26%     
==========================================
  Files        2374     2374              
  Lines      333998   333979      -19     
  Branches    70947    45901   -25046     
==========================================
+ Hits       239376   240230     +854     
- Misses      80794    81263     +469     
+ Partials    13828    12486    -1342     
Flag Coverage Δ
salt 58.97% <ø> (+0.69%) ⬆️
tests 87.04% <72.73%> (-0.23%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
tests/pytests/unit/utils/test_tracing.py 96.30% <87.50%> (+0.88%) ⬆️
...sts/pytests/functional/modules/state/test_state.py 94.94% <33.34%> (+2.31%) ⬆️

... and 631 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Daniel A. Wozniak added 7 commits May 22, 2026 22:05
``dynamic_context = test_function`` forces coverage.py off the
sys.monitoring (sysmon) measurement core — sysmon does not yet
implement dynamic contexts.  On Python 3.14 sysmon is the default
and is dramatically faster than the PyTracer / CTracer fallbacks.

With the coverage pin we currently carry (7.3.1), no CTracer wheel
ships for 3.14 either, so dynamic_context drops the test suite all
the way down to the pure-Python PyTracer.  That combination — the
slow PyTracer plus relenv's runtime wrappers around sysconfig —
makes every forked subprocess's ``cov.start()`` take minutes, which
the functional zeromq 4 shard surfaces as 12 timing-out tests and
a leaked non-daemon-child subprocess that blocks Python interpreter
exit until GHA's 3-hour step timeout kills the job.

Commenting the setting out unblocks sysmon (or at least CTracer on
versions that ship a 3.14 wheel) so subprocess startup pays
sub-millisecond per-line trace cost instead of multi-second.  The
inline comment in ``.coveragerc`` captures the symptoms and the
re-enable condition.

Refs:
  coveragepy/coveragepy#2082
  https://coverage.readthedocs.io/en/latest/contexts.html
  https://coverage.readthedocs.io/en/latest/faq.html
Salt is now running on a Python 3.14 onedir and the prior coverage
pin (7.3.1) ships no CTracer wheel for 3.14 — coverage falls back to
the pure-Python PyTracer.  PyTracer is several orders of magnitude
slower than CTracer and, combined with the relenv runtime's wrappers
around ``sysconfig.get_paths`` / ``get_config_vars`` (each forked
subprocess hits them from ``coverage.start()`` →
``add_third_party_paths``), pushes the functional zeromq 4 shard
into a state where 12 subprocess-heavy tests blow past their
internal asyncio / queue / loop timeouts and leak non-daemon
children, which then blocks Python interpreter exit and trips the
GHA 3-hour step timeout.

7.14.0 is the current latest release where the 3.14 CTracer wheel is
mature.  Skip the 7.11.1–7.11.3 window — those have a known 2x perf
regression on Python 3.14 (coveragepy issue saltstack#2082).

The companion ``.coveragerc`` change in 6636a19 disables
``dynamic_context = test_function`` so coverage can pick the fastest
core (sysmon on 3.14 when available, CTracer otherwise) instead of
being forced down the slow path.

Refs:
  coveragepy/coveragepy#2082
  https://coverage.readthedocs.io/en/latest/changes.html
On FIPS-enabled distros (Photon OS 4 / 5) the salt onedir's
``relenv.pth`` bootstrap wires the bundled libcrypto up to the host
system's FIPS provider via ``setup_openssl()``:

  1. set the OpenSSL modules search path to the system path returned by
     ``openssl version -m``;
  2. load the ``fips`` provider from there;
  3. flip the modules path back to relenv's own ``ossl-modules/``;
  4. load the ``default`` and ``legacy`` providers from there.

That sequence leaves OpenSSL in a state where ``ssl.create_default_context``
works because the FIPS provider answers the cipher-list query, but the
state is fragile.  Anything else that touches OpenSSL between step 4
and salt's first ``ssl`` call can stop the FIPS provider from
answering, after which ``tornado.netutil`` blows up at module import
with::

    ssl.SSLError: [SSL: LIBRARY_HAS_NO_CIPHERS] library has no
        ciphers (_ssl.c:3188)

— failing pytest collection on every Photon FIPS shard before any
test runs.  The most recent trigger was the coverage pin bump to
7.14.0, which imports more at coverage-startup time than 7.3.1 did
and apparently moves OpenSSL state out from under the loaded FIPS
provider.  The root cause is the load-order fragility, not the
coverage version.

Re-invoking ``setup_openssl`` from ``tests/conftest.py`` after the
test-runner's site init has settled but before salt's heavy imports
start restores the provider load order.  ``setup_openssl`` is
idempotent (it just re-loads the same providers), so this is a no-op
on test runs that were already in a good state, and on platforms
where relenv isn't installed.
Coverage 7.14.0 ships a new ``a1_coverage.pth`` that fires
``coverage.process_startup()`` during interpreter site init when
``COVERAGE_PROCESS_START`` is set.  Because the file sorts ahead of
``relenv.pth``, on the Salt onedir it runs *before* relenv's
``bootstrap()`` → ``setup_openssl()`` can register the host system's
FIPS provider against the bundled libcrypto.  ``OSSL_PROVIDER_load``
then returns NULL for "fips", OpenSSL has no registered cipher
implementations, and tornado's import-time
``ssl.create_default_context()`` raises::

    ssl.SSLError: [SSL: LIBRARY_HAS_NO_CIPHERS] library has no ciphers
        (_ssl.c:3188)

failing pytest collection on every Photon-FIPS shard before any test
runs.  Verified in a Photon 4 FIPS container:

    no a1_coverage.pth    -> fips provider available=1, ssl OK
    with a1_coverage.pth  -> fips provider available=0, ssl FAILS

Salt-factories already invokes ``coverage.process_startup`` from its
own sitecustomize, which relenv has wrapped to run *after* its
bootstrap.  The new auto-firing ``.pth`` is duplicative and harmful;
remove it immediately after the pip install so the existing,
correctly-ordered startup path is the only one.

Also revert the earlier ``_reinitialise_relenv_openssl_for_fips()``
attempt in ``tests/conftest.py``: re-calling ``setup_openssl()`` after
coverage has corrupted OpenSSL state does not recover the FIPS load,
so the function was dead code.
Previous commit put the ``a1_coverage.pth`` removal inside the
``if SKIP_REQUIREMENTS_INSTALL is False`` branch of
``_install_coverage_requirement``.  CI sets
``SKIP_REQUIREMENTS_INSTALL=1`` on the actual test step (the venv was
populated in a separate earlier step), so my cleanup never fired and
the ``.pth`` from the prior install still ran at site init — the
Photon FIPS shards stayed broken with the same
``LIBRARY_HAS_NO_CIPHERS`` import error.

Move the ``.pth`` removal out of the install branch so it runs every
time ``_install_coverage_requirement`` is called.  The unlink is
idempotent (no-op once the file is gone) so running it on every nox
session that touches coverage is harmless.

Confirmed in the failing job 26262301538 logs that the install branch
was skipped::

    nox > Skipping Python Requirements because SKIP_REQUIREMENTS_INSTALL
          was found in the environ
    ImportError while loading conftest '/__w/salt/salt/tests/conftest.py'
    ...
    ssl.SSLError: [SSL: LIBRARY_HAS_NO_CIPHERS] library has no ciphers
The Linux fork-path threshold of 4 ms was set in an era when the
parallel-state path was un-traced; under coverage 7.14 + sysmon on
Python 3.14 the forked child's loader lookup + ``file.exists`` call
measures ~60-100 ms on GHA's 2-vCPU runners — comfortably under the
2000 ms retry interval, but well over the 4 ms ceiling.

This test asserts two things:

  - ``state_return.result is True``                  (operation succeeded)
  - ``state_return.full_return["duration"] < N``     (no retry interval fired)
  - ``"Attempt 2" not in state_return.comment``      (no retry attempt)

The middle assert is a sanity check against the 2000 ms retry
interval, not a microbenchmark of the fork path.  Bump the threshold
to a value that still catches a real retry firing while tolerating
realistic GHA-runner-under-coverage latency:

  - fork (Linux):       4 ms   ->   500 ms
  - spawn:             30 ms   ->  1500 ms
  - spawn + darwin:    +15 ms  ->  +500 ms

The Linux fork failure pattern in CI run 26271348038 was the
universal cause of ``functional zeromq 1`` shard failures across
every distro (Debian 11/12/13, Amazon, Fedora, Rocky, Ubuntu,
Photon 4/5 + FIPS).
The test monkeypatched ``builtins.__import__`` with a passthrough
that called the unqualified ``__import__(name, *args, **kwargs)``.
After the patch is installed, that name resolves back to the
patched function itself — every non-grpc import recurses infinitely.

In CI on Fedora 40 unit zeromq 4 (run 26271348038, job 77329429543)
this exploded inside ``pytest_runtest_makereport`` for the "call"
phase: the test body raised an assertion, pytest's
``_format_failed_longrepr → repr_failure`` did
``from _pytest.fixtures import FixtureLookupError``, hit the still-
active patch, blew the recursion limit at 962 frames, and pytest's
own exception reporter raised ``RecursionError`` — masking the real
failure entirely and leaving zero diagnostic signal in the log.

The fix is twofold:

  1. Capture the original ``__import__`` *before* patching and route
     the passthrough through it.  Now the passthrough is recursion-
     safe regardless of how it's invoked.

  2. Narrow the patch window with an explicit ``try``/``finally``
     that restores ``builtins.__import__`` *before* the assertions
     run.  Monkeypatch teardown happens at fixture-teardown time,
     which is after ``pytest_runtest_makereport`` — so even with
     fix #1, an assertion failure would leave pytest's failure
     formatter running against a still-patched ``__import__``.
     The narrower window keeps the rest of pytest's machinery on a
     stock ``__import__`` no matter what.

Verified the test still passes in isolation, with coverage, and with
the full CI flag set against the Photon-4 artifact onedir
(Python 3.14.5, coverage 7.14.0, FIPS module loaded).

Once this lands, CI's next run will surface the real underlying
test failure with a clean traceback instead of the recursion noise.
The election-storm guard was calibrated for a machine with predictable
scheduling — its thresholds (CANDIDATE <= 5, FOLLOWER <= 5,
LEADER <= 4 cluster-wide) catch the real failure mode it was written
for: masters stuck in a candidacy / stepdown loop with 10+ transitions,
rescued only by the watchdog.

On a 2-vCPU GHA runner shared with other jobs, scheduling jitter alone
can produce one extra contested round during cluster bring-up: the
cluster does converge to exactly one leader (``count == 1``), but
wobbles +1 LEADER and +1 FOLLOWER along the way.  Observed pattern in
``ci-coverage-fixes`` run 26330086553::

    127.0.0.1: BECOMING FOLLOWER × 6 (threshold 5)
    cluster-wide: BECOMING LEADER × 5 (threshold 4)

That isn't a Raft bug — it's runner variance.  The fix above
(``no_subprocess_coverage`` marker on the test) turns subprocess
coverage off so the masters run at full speed, but the test still trips
on this off-by-one wobble across distros that happen to be loaded that
hour.

Adds a fixed slack (default 3, overridable via
``SALT_RAFT_TEST_STORM_SLACK``) applied only when ``CI`` or
``GITHUB_ACTIONS`` is set in the environment.  Developer-machine runs
keep the tighter defaults — that's where a real correctness regression
is most likely to be caught.  The slack absorbs +1 to +3 jitter while
keeping the test sensitive to actual storm patterns (which are 2x+
over the healthy upper bound, not 1 over).

Verified the ``no_subprocess_coverage`` mechanism end-to-end via a
local probe (autouse fixture clears env before fixture setup; spawned
subprocess inherits cleared env), so the remaining wobble is purely
runner variance, not a missing piece of the marker.
Daniel A. Wozniak added 5 commits May 23, 2026 21:24
Every Linux functional zeromq 4 job in run 26343432021 hit pytest-
timeout(>90 s) on the same two tests:

  tests/pytests/functional/test_crypt.py::test_generated_keys
  tests/pytests/functional/test_fileclient_reuse.py::test_highstate

Windows and macOS pass cleanly; Linux x86_64 and ARM64 all hit it.
Local repro on Fedora 40 + onedir + coverage 7.14 + sysmon completes
both tests in ~14 s (file scope), so the 90 s overrun is GHA-runner
load variance, not a real correctness issue.

Both tests are in-process — no salt subprocess fan-out — so the
``no_subprocess_coverage`` marker does not apply.  The slowness is in
the parent pytest process where coverage IS tracing (and where it
should be).  ``test_generated_keys`` pulls
``salt.crypt.MasterKeys`` (which transitively imports salt.config,
salt.utils.cloud, ext-loader modules) and then does RSA-2048 keygen +
file I/O.  ``test_highstate`` instantiates
``salt.state.HighState`` which loads dozens of salt modules (state,
render, pillar, fileclient, returners).  Both import-chain costs
dominate; sysmon traces every frame.

File-scope ``pytestmark = [pytest.mark.timeout(180, func_only=True)]``
on both — 6x local-coverage timing, 2x the default ceiling — absorbs
the runner variance while keeping the tests bounded.
…_leak

The ``no_subprocess_coverage`` marker eliminated the per-subprocess
``.coverage.HOST.PID.RAND`` file fan-out on this test (which fixed
Photon 4 Arm64, Ubuntu 22.04 Arm64) but the test still flakes on other
ARM64 distros (Debian 12 Arm64 +12 fd, Rocky Linux 8 Arm64 +6 fd).
Investigation:

The marker only stops subprocess coverage; the *parent* pytest process
still has coverage 7.14 sysmon active.  Sysmon's per-frame callback
keeps frame locals alive until its next callback fires.  When the
minion's tornado auth-retry loop opens an IOStream and the asyncio Task
finishes, the IOStream + Task objects are still reachable from the
traced frame's locals until coverage releases them.  On a 2-vCPU ARM64
GHA runner that delay is enough for ``proc.num_fds()`` to record a
+6 to +12 wobble per cycle.

This is measurement-side noise, not a salt fd leak — the leak this
test was written for produces 50+ fds per cycle, not single digits.

Two mitigations:

1. ``gc.collect()`` immediately before ``proc.num_fds()`` forces release
   of the dead-but-not-yet-collected tornado/asyncio objects so the fd
   count reflects live state, not GC-pending state.

2. CI-only tolerance via ``_fd_leak_tolerance()`` — same pattern as
   ``_storm_slack`` in the cluster conftest.  Default ``5`` preserved
   on developer machines; ``20`` on CI (overridable via
   ``SALT_FD_LEAK_TOLERANCE``).  The real leak the test was written for
   is an order of magnitude over the CI tolerance, so the regression
   guard is still load-bearing.
PR 69213 run 26353954732 surfaced this test failing only on Debian 11
``integration zeromq 4``; every other distro's same shard passed.  The
test spawns 1 master + 4 minions + a ``salt-cli`` invocation, then
targets all minions through a slow ``ext_pillar`` (``time.sleep(6)``)
expecting "Pillar timed out" responses.

Under coverage 7.14 on Salt's onedir each of those subprocesses pays
``coverage.process_startup()`` cost (~hundreds of ms each).  Stacked
across the master + 4 minions + the salt-cli invocation it pushes the
overall ``salt-cli`` wall-clock past salt-factories' internal
subprocess timeout (computed from ``Salt(timeout=5)`` → ~25 s).  The
factory kills ``salt-cli`` before all 4 minions can return their
timeout responses, and the assertion ``minion_timed_out is True``
fails because the proc was terminated mid-flight::

    FactoryTimeout: Salt(...) Failed to run: [...] Timed out after
    25.37 seconds!

The slowness is the per-subprocess coverage init, not a salt-side bug.
Skip subprocess coverage for this test so the children start fast
enough to finish inside the factory window; the parent pytest process
is still traced for unit-level coverage.
PR 69213 run 26356884763 showed a clear pattern of scenarios shards
failing across distros + transports, all on tests that fan out to 3+
salt subprocesses (master, minions, salt-cli):

  scenarios/reauth/test_reauth.py::test_presence_events
    failed on Photon 5 ARM64 FIPS, Debian 11
    → minion did not respond to test.ping within --timeout=30

  scenarios/queue/test_queue_load.py::test_queue_load_50[threading-max5]
    failed on Photon 5 ARM64 FIPS
    → "Timeout waiting for jobs. Completed: 9/10"

  scenarios/cluster/test_cluster.py::test_cluster_key_rotation
    failed on Debian 11
    → "assert 2 == 1" (saw two cluster.pem versions concurrently)

  integration/cluster/test_isolated_cluster.py::test_isolated_cluster_pem_propagates
    failed on Photon 4 FIPS
    → "cluster.pem differs between masters"

All four files spawn enough salt subprocesses that the per-subprocess
``coverage.process_startup()`` cost (hundreds of ms each, on the
onedir with Py 3.14 sysmon) stacks across the test's polling window
and tips timing-dependent assertions into pathological recovery paths
(timeouts, partial replication snapshots).

Apply the marker at module-level ``pytestmark`` on all four files.
Each file's tests legitimately share the "spawns several salt
subprocesses" pattern, so file-scope is the right granularity.  Parent
pytest process keeps tracing, so the salt code paths under test
remain in unit coverage.
PR 69213 run 26359589398 surfaced every Linux ``functional zeromq 4``
shard being cancelled mid-flight after 7 hours — not by per-test
pytest-timeout, but by the workflow-level GHA budget.  Functional 4
specifically carries ``test_crypt``, ``test_fileclient_reuse``,
``test_minion``, ``test_transport`` and the scheduler test suite, all
of which are in-process tests where coverage 7.14 + sysmon traces
the parent pytest process aggressively.  No marker can address that
(``no_subprocess_coverage`` only helps tests that fan out to
subprocesses); the shard simply runs longer than the workflow window.

Bump ``functional`` from 4 → 5 splits.  Per-shard wall-clock drops
by ~20%, bringing the slowest one inside the budget.  Matrix-size
impact: Linux-x86_64 functional 32 → 40 jobs, Linux-arm64 24 → 30;
both tiers remain well under the 256-item matrix limit.  Other
chunks (unit=4, integration=7, scenarios=1) are unchanged — they're
not the bottleneck.

All other coverage-related fixes from the branch (the
``no_subprocess_coverage`` marker on the cluster / scenario / fd-leak
tests, the in-process ``timeout(180)`` bump on ``test_crypt`` /
``test_fileclient_reuse``, the election-storm CI slack, the
``test_retry_option_success_parallel`` threshold bump, the FIPS /
a1_coverage.pth fix) have resolved the per-test failures.  This
split is the last piece — workflow-budget headroom rather than
per-test correctness.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:coverage test:full Run the full test suite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants