[wip] Ci coverage fixes#69213
Open
dwoz wants to merge 15 commits into
Open
Conversation
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 3008.x #69213 +/- ##
==========================================
+ Coverage 71.67% 71.93% +0.26%
==========================================
Files 2374 2374
Lines 333998 333979 -19
Branches 70947 45901 -25046
==========================================
+ Hits 239376 240230 +854
- Misses 80794 81263 +469
+ Partials 13828 12486 -1342
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
added 7 commits
May 22, 2026 22:05
``dynamic_context = test_function`` forces coverage.py off the sys.monitoring (sysmon) measurement core — sysmon does not yet implement dynamic contexts. On Python 3.14 sysmon is the default and is dramatically faster than the PyTracer / CTracer fallbacks. With the coverage pin we currently carry (7.3.1), no CTracer wheel ships for 3.14 either, so dynamic_context drops the test suite all the way down to the pure-Python PyTracer. That combination — the slow PyTracer plus relenv's runtime wrappers around sysconfig — makes every forked subprocess's ``cov.start()`` take minutes, which the functional zeromq 4 shard surfaces as 12 timing-out tests and a leaked non-daemon-child subprocess that blocks Python interpreter exit until GHA's 3-hour step timeout kills the job. Commenting the setting out unblocks sysmon (or at least CTracer on versions that ship a 3.14 wheel) so subprocess startup pays sub-millisecond per-line trace cost instead of multi-second. The inline comment in ``.coveragerc`` captures the symptoms and the re-enable condition. Refs: coveragepy/coveragepy#2082 https://coverage.readthedocs.io/en/latest/contexts.html https://coverage.readthedocs.io/en/latest/faq.html
Salt is now running on a Python 3.14 onedir and the prior coverage pin (7.3.1) ships no CTracer wheel for 3.14 — coverage falls back to the pure-Python PyTracer. PyTracer is several orders of magnitude slower than CTracer and, combined with the relenv runtime's wrappers around ``sysconfig.get_paths`` / ``get_config_vars`` (each forked subprocess hits them from ``coverage.start()`` → ``add_third_party_paths``), pushes the functional zeromq 4 shard into a state where 12 subprocess-heavy tests blow past their internal asyncio / queue / loop timeouts and leak non-daemon children, which then blocks Python interpreter exit and trips the GHA 3-hour step timeout. 7.14.0 is the current latest release where the 3.14 CTracer wheel is mature. Skip the 7.11.1–7.11.3 window — those have a known 2x perf regression on Python 3.14 (coveragepy issue saltstack#2082). The companion ``.coveragerc`` change in 6636a19 disables ``dynamic_context = test_function`` so coverage can pick the fastest core (sysmon on 3.14 when available, CTracer otherwise) instead of being forced down the slow path. Refs: coveragepy/coveragepy#2082 https://coverage.readthedocs.io/en/latest/changes.html
On FIPS-enabled distros (Photon OS 4 / 5) the salt onedir's
``relenv.pth`` bootstrap wires the bundled libcrypto up to the host
system's FIPS provider via ``setup_openssl()``:
1. set the OpenSSL modules search path to the system path returned by
``openssl version -m``;
2. load the ``fips`` provider from there;
3. flip the modules path back to relenv's own ``ossl-modules/``;
4. load the ``default`` and ``legacy`` providers from there.
That sequence leaves OpenSSL in a state where ``ssl.create_default_context``
works because the FIPS provider answers the cipher-list query, but the
state is fragile. Anything else that touches OpenSSL between step 4
and salt's first ``ssl`` call can stop the FIPS provider from
answering, after which ``tornado.netutil`` blows up at module import
with::
ssl.SSLError: [SSL: LIBRARY_HAS_NO_CIPHERS] library has no
ciphers (_ssl.c:3188)
— failing pytest collection on every Photon FIPS shard before any
test runs. The most recent trigger was the coverage pin bump to
7.14.0, which imports more at coverage-startup time than 7.3.1 did
and apparently moves OpenSSL state out from under the loaded FIPS
provider. The root cause is the load-order fragility, not the
coverage version.
Re-invoking ``setup_openssl`` from ``tests/conftest.py`` after the
test-runner's site init has settled but before salt's heavy imports
start restores the provider load order. ``setup_openssl`` is
idempotent (it just re-loads the same providers), so this is a no-op
on test runs that were already in a good state, and on platforms
where relenv isn't installed.
Coverage 7.14.0 ships a new ``a1_coverage.pth`` that fires
``coverage.process_startup()`` during interpreter site init when
``COVERAGE_PROCESS_START`` is set. Because the file sorts ahead of
``relenv.pth``, on the Salt onedir it runs *before* relenv's
``bootstrap()`` → ``setup_openssl()`` can register the host system's
FIPS provider against the bundled libcrypto. ``OSSL_PROVIDER_load``
then returns NULL for "fips", OpenSSL has no registered cipher
implementations, and tornado's import-time
``ssl.create_default_context()`` raises::
ssl.SSLError: [SSL: LIBRARY_HAS_NO_CIPHERS] library has no ciphers
(_ssl.c:3188)
failing pytest collection on every Photon-FIPS shard before any test
runs. Verified in a Photon 4 FIPS container:
no a1_coverage.pth -> fips provider available=1, ssl OK
with a1_coverage.pth -> fips provider available=0, ssl FAILS
Salt-factories already invokes ``coverage.process_startup`` from its
own sitecustomize, which relenv has wrapped to run *after* its
bootstrap. The new auto-firing ``.pth`` is duplicative and harmful;
remove it immediately after the pip install so the existing,
correctly-ordered startup path is the only one.
Also revert the earlier ``_reinitialise_relenv_openssl_for_fips()``
attempt in ``tests/conftest.py``: re-calling ``setup_openssl()`` after
coverage has corrupted OpenSSL state does not recover the FIPS load,
so the function was dead code.
Previous commit put the ``a1_coverage.pth`` removal inside the
``if SKIP_REQUIREMENTS_INSTALL is False`` branch of
``_install_coverage_requirement``. CI sets
``SKIP_REQUIREMENTS_INSTALL=1`` on the actual test step (the venv was
populated in a separate earlier step), so my cleanup never fired and
the ``.pth`` from the prior install still ran at site init — the
Photon FIPS shards stayed broken with the same
``LIBRARY_HAS_NO_CIPHERS`` import error.
Move the ``.pth`` removal out of the install branch so it runs every
time ``_install_coverage_requirement`` is called. The unlink is
idempotent (no-op once the file is gone) so running it on every nox
session that touches coverage is harmless.
Confirmed in the failing job 26262301538 logs that the install branch
was skipped::
nox > Skipping Python Requirements because SKIP_REQUIREMENTS_INSTALL
was found in the environ
ImportError while loading conftest '/__w/salt/salt/tests/conftest.py'
...
ssl.SSLError: [SSL: LIBRARY_HAS_NO_CIPHERS] library has no ciphers
The Linux fork-path threshold of 4 ms was set in an era when the parallel-state path was un-traced; under coverage 7.14 + sysmon on Python 3.14 the forked child's loader lookup + ``file.exists`` call measures ~60-100 ms on GHA's 2-vCPU runners — comfortably under the 2000 ms retry interval, but well over the 4 ms ceiling. This test asserts two things: - ``state_return.result is True`` (operation succeeded) - ``state_return.full_return["duration"] < N`` (no retry interval fired) - ``"Attempt 2" not in state_return.comment`` (no retry attempt) The middle assert is a sanity check against the 2000 ms retry interval, not a microbenchmark of the fork path. Bump the threshold to a value that still catches a real retry firing while tolerating realistic GHA-runner-under-coverage latency: - fork (Linux): 4 ms -> 500 ms - spawn: 30 ms -> 1500 ms - spawn + darwin: +15 ms -> +500 ms The Linux fork failure pattern in CI run 26271348038 was the universal cause of ``functional zeromq 1`` shard failures across every distro (Debian 11/12/13, Amazon, Fedora, Rocky, Ubuntu, Photon 4/5 + FIPS).
The test monkeypatched ``builtins.__import__`` with a passthrough
that called the unqualified ``__import__(name, *args, **kwargs)``.
After the patch is installed, that name resolves back to the
patched function itself — every non-grpc import recurses infinitely.
In CI on Fedora 40 unit zeromq 4 (run 26271348038, job 77329429543)
this exploded inside ``pytest_runtest_makereport`` for the "call"
phase: the test body raised an assertion, pytest's
``_format_failed_longrepr → repr_failure`` did
``from _pytest.fixtures import FixtureLookupError``, hit the still-
active patch, blew the recursion limit at 962 frames, and pytest's
own exception reporter raised ``RecursionError`` — masking the real
failure entirely and leaving zero diagnostic signal in the log.
The fix is twofold:
1. Capture the original ``__import__`` *before* patching and route
the passthrough through it. Now the passthrough is recursion-
safe regardless of how it's invoked.
2. Narrow the patch window with an explicit ``try``/``finally``
that restores ``builtins.__import__`` *before* the assertions
run. Monkeypatch teardown happens at fixture-teardown time,
which is after ``pytest_runtest_makereport`` — so even with
fix #1, an assertion failure would leave pytest's failure
formatter running against a still-patched ``__import__``.
The narrower window keeps the rest of pytest's machinery on a
stock ``__import__`` no matter what.
Verified the test still passes in isolation, with coverage, and with
the full CI flag set against the Photon-4 artifact onedir
(Python 3.14.5, coverage 7.14.0, FIPS module loaded).
Once this lands, CI's next run will surface the real underlying
test failure with a clean traceback instead of the recursion noise.
The election-storm guard was calibrated for a machine with predictable
scheduling — its thresholds (CANDIDATE <= 5, FOLLOWER <= 5,
LEADER <= 4 cluster-wide) catch the real failure mode it was written
for: masters stuck in a candidacy / stepdown loop with 10+ transitions,
rescued only by the watchdog.
On a 2-vCPU GHA runner shared with other jobs, scheduling jitter alone
can produce one extra contested round during cluster bring-up: the
cluster does converge to exactly one leader (``count == 1``), but
wobbles +1 LEADER and +1 FOLLOWER along the way. Observed pattern in
``ci-coverage-fixes`` run 26330086553::
127.0.0.1: BECOMING FOLLOWER × 6 (threshold 5)
cluster-wide: BECOMING LEADER × 5 (threshold 4)
That isn't a Raft bug — it's runner variance. The fix above
(``no_subprocess_coverage`` marker on the test) turns subprocess
coverage off so the masters run at full speed, but the test still trips
on this off-by-one wobble across distros that happen to be loaded that
hour.
Adds a fixed slack (default 3, overridable via
``SALT_RAFT_TEST_STORM_SLACK``) applied only when ``CI`` or
``GITHUB_ACTIONS`` is set in the environment. Developer-machine runs
keep the tighter defaults — that's where a real correctness regression
is most likely to be caught. The slack absorbs +1 to +3 jitter while
keeping the test sensitive to actual storm patterns (which are 2x+
over the healthy upper bound, not 1 over).
Verified the ``no_subprocess_coverage`` mechanism end-to-end via a
local probe (autouse fixture clears env before fixture setup; spawned
subprocess inherits cleared env), so the remaining wobble is purely
runner variance, not a missing piece of the marker.
added 5 commits
May 23, 2026 21:24
Every Linux functional zeromq 4 job in run 26343432021 hit pytest- timeout(>90 s) on the same two tests: tests/pytests/functional/test_crypt.py::test_generated_keys tests/pytests/functional/test_fileclient_reuse.py::test_highstate Windows and macOS pass cleanly; Linux x86_64 and ARM64 all hit it. Local repro on Fedora 40 + onedir + coverage 7.14 + sysmon completes both tests in ~14 s (file scope), so the 90 s overrun is GHA-runner load variance, not a real correctness issue. Both tests are in-process — no salt subprocess fan-out — so the ``no_subprocess_coverage`` marker does not apply. The slowness is in the parent pytest process where coverage IS tracing (and where it should be). ``test_generated_keys`` pulls ``salt.crypt.MasterKeys`` (which transitively imports salt.config, salt.utils.cloud, ext-loader modules) and then does RSA-2048 keygen + file I/O. ``test_highstate`` instantiates ``salt.state.HighState`` which loads dozens of salt modules (state, render, pillar, fileclient, returners). Both import-chain costs dominate; sysmon traces every frame. File-scope ``pytestmark = [pytest.mark.timeout(180, func_only=True)]`` on both — 6x local-coverage timing, 2x the default ceiling — absorbs the runner variance while keeping the tests bounded.
…_leak The ``no_subprocess_coverage`` marker eliminated the per-subprocess ``.coverage.HOST.PID.RAND`` file fan-out on this test (which fixed Photon 4 Arm64, Ubuntu 22.04 Arm64) but the test still flakes on other ARM64 distros (Debian 12 Arm64 +12 fd, Rocky Linux 8 Arm64 +6 fd). Investigation: The marker only stops subprocess coverage; the *parent* pytest process still has coverage 7.14 sysmon active. Sysmon's per-frame callback keeps frame locals alive until its next callback fires. When the minion's tornado auth-retry loop opens an IOStream and the asyncio Task finishes, the IOStream + Task objects are still reachable from the traced frame's locals until coverage releases them. On a 2-vCPU ARM64 GHA runner that delay is enough for ``proc.num_fds()`` to record a +6 to +12 wobble per cycle. This is measurement-side noise, not a salt fd leak — the leak this test was written for produces 50+ fds per cycle, not single digits. Two mitigations: 1. ``gc.collect()`` immediately before ``proc.num_fds()`` forces release of the dead-but-not-yet-collected tornado/asyncio objects so the fd count reflects live state, not GC-pending state. 2. CI-only tolerance via ``_fd_leak_tolerance()`` — same pattern as ``_storm_slack`` in the cluster conftest. Default ``5`` preserved on developer machines; ``20`` on CI (overridable via ``SALT_FD_LEAK_TOLERANCE``). The real leak the test was written for is an order of magnitude over the CI tolerance, so the regression guard is still load-bearing.
PR 69213 run 26353954732 surfaced this test failing only on Debian 11
``integration zeromq 4``; every other distro's same shard passed. The
test spawns 1 master + 4 minions + a ``salt-cli`` invocation, then
targets all minions through a slow ``ext_pillar`` (``time.sleep(6)``)
expecting "Pillar timed out" responses.
Under coverage 7.14 on Salt's onedir each of those subprocesses pays
``coverage.process_startup()`` cost (~hundreds of ms each). Stacked
across the master + 4 minions + the salt-cli invocation it pushes the
overall ``salt-cli`` wall-clock past salt-factories' internal
subprocess timeout (computed from ``Salt(timeout=5)`` → ~25 s). The
factory kills ``salt-cli`` before all 4 minions can return their
timeout responses, and the assertion ``minion_timed_out is True``
fails because the proc was terminated mid-flight::
FactoryTimeout: Salt(...) Failed to run: [...] Timed out after
25.37 seconds!
The slowness is the per-subprocess coverage init, not a salt-side bug.
Skip subprocess coverage for this test so the children start fast
enough to finish inside the factory window; the parent pytest process
is still traced for unit-level coverage.
PR 69213 run 26356884763 showed a clear pattern of scenarios shards
failing across distros + transports, all on tests that fan out to 3+
salt subprocesses (master, minions, salt-cli):
scenarios/reauth/test_reauth.py::test_presence_events
failed on Photon 5 ARM64 FIPS, Debian 11
→ minion did not respond to test.ping within --timeout=30
scenarios/queue/test_queue_load.py::test_queue_load_50[threading-max5]
failed on Photon 5 ARM64 FIPS
→ "Timeout waiting for jobs. Completed: 9/10"
scenarios/cluster/test_cluster.py::test_cluster_key_rotation
failed on Debian 11
→ "assert 2 == 1" (saw two cluster.pem versions concurrently)
integration/cluster/test_isolated_cluster.py::test_isolated_cluster_pem_propagates
failed on Photon 4 FIPS
→ "cluster.pem differs between masters"
All four files spawn enough salt subprocesses that the per-subprocess
``coverage.process_startup()`` cost (hundreds of ms each, on the
onedir with Py 3.14 sysmon) stacks across the test's polling window
and tips timing-dependent assertions into pathological recovery paths
(timeouts, partial replication snapshots).
Apply the marker at module-level ``pytestmark`` on all four files.
Each file's tests legitimately share the "spawns several salt
subprocesses" pattern, so file-scope is the right granularity. Parent
pytest process keeps tracing, so the salt code paths under test
remain in unit coverage.
PR 69213 run 26359589398 surfaced every Linux ``functional zeromq 4`` shard being cancelled mid-flight after 7 hours — not by per-test pytest-timeout, but by the workflow-level GHA budget. Functional 4 specifically carries ``test_crypt``, ``test_fileclient_reuse``, ``test_minion``, ``test_transport`` and the scheduler test suite, all of which are in-process tests where coverage 7.14 + sysmon traces the parent pytest process aggressively. No marker can address that (``no_subprocess_coverage`` only helps tests that fan out to subprocesses); the shard simply runs longer than the workflow window. Bump ``functional`` from 4 → 5 splits. Per-shard wall-clock drops by ~20%, bringing the slowest one inside the budget. Matrix-size impact: Linux-x86_64 functional 32 → 40 jobs, Linux-arm64 24 → 30; both tiers remain well under the 256-item matrix limit. Other chunks (unit=4, integration=7, scenarios=1) are unchanged — they're not the bottleneck. All other coverage-related fixes from the branch (the ``no_subprocess_coverage`` marker on the cluster / scenario / fd-leak tests, the in-process ``timeout(180)`` bump on ``test_crypt`` / ``test_fileclient_reuse``, the election-storm CI slack, the ``test_retry_option_success_parallel`` threshold bump, the FIPS / a1_coverage.pth fix) have resolved the per-test failures. This split is the last piece — workflow-budget headroom rather than per-test correctness.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
What issues does this PR fix or reference?
Fixes
Previous Behavior
Remove this section if not relevant
New Behavior
Remove this section if not relevant
Merge requirements satisfied?
[NOTICE] Bug fixes or features added to Salt require tests.
Commits signed with GPG?
Yes/No