[discrete diffusion] Add dflash pipeline#13699
Open
kashif wants to merge 23 commits into
Open
Conversation
Adds DFlashPipeline + DFlashTokenDiffusionScheduler for block-diffusion speculative decoding with a draft DFlash model and a target causal LM. Verified against the six bug patterns surfaced in the LLaDA2 review (huggingface#13598). DFlash sidesteps most of them by being batch_size=1 only and relying on the causal default for attention; the applicable patterns (huggingface#3 callback bindings, huggingface#4 EOS at first generated position, huggingface#6 inner progress-bar config preservation) are pinned by regression tests. Public surface mirrors the LLaDA2 / SDAR / IDLM conventions: lazy import, dummy objects, scheduler + output dataclass, pipeline + output dataclass, fast tests for both, scheduler doc page, pipeline doc page. Sample/train scripts under examples/discrete_diffusion/.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
- Training: `position_ids` must span `[0, start + block_size)` so the draft's attention RoPE cos/sin covers both `k_ctx` (target_hidden, length `start`) and `k_noise` (noise_embedding, length `block_size`). Previously we passed only `arange(start, start + block_size)` which triggered a K-side broadcast mismatch on the very first batch. - Docs/examples: target loads as plain Qwen3 / Qwen3.5 (no remote code), but the draft's custom DFlashDraftModel class lives in the Hub repo's `auto_map`, so `trust_remote_code=True` is required for draft loads only. Updated the example docstring, pipeline doc page, sample script, train script, and the GPU verify script. Smoke-tested via srun on z-lab/Qwen3.5-4B-DFlash + Qwen/Qwen3.5-4B (H100): 3 steps complete, final checkpoint saved.
…rgets
The pipeline previously short-circuited to `draft.spec_generate(...)` when
the draft model exposed it (e.g. z-lab/Qwen3-8B-DFlash-b16). That path is
the upstream `dflash_generate` loop, which calls `past_key_values_target.crop()`
unconditionally — fine for full-attention targets, but on hybrid targets it
silently corrupts the linear-attention recurrent state.
Confirmed in transformers 5.8.0.dev0 at cache_utils.py:759-761:
def crop(self, max_length: int):
# We don't crop the linear attention cache, so simply do nothing here
pass
`LinearAttentionCacheLayerMixin.crop` is documented as a no-op, so any
verify loop that relies on `cache.crop()` for rollback is wrong on hybrid
attention targets. Our explicit loop already handles this via
`DFlashTokenDiffusionScheduler.snapshot_cache` / `restore_cache` plus an
accepted-prefix re-forward, and reduces to a plain `.crop()` on full-attn
targets.
Verified end-to-end on GPU after the removal:
- z-lab/Qwen3.5-4B-DFlash + Qwen/Qwen3.5-4B (hybrid attn): "2 + 2 equals 4."
- z-lab/Qwen3-8B-DFlash-b16 + Qwen/Qwen3-8B (full attn): "2 + 2 equals 4."
Fast tests: 43 passed.
e97f7ae to
a70e329
Compare
- Use `self._execution_device` instead of device detection via `parameters()` in `DFlashPipeline.__call__`; remove redundant draft-device check/warning - Remove `add_noise` from `DFlashTokenDiffusionScheduler` — it implemented MDLM-style uniform block masking (wrong algorithm for DFlash training) and was never called at inference; DFlash training uses anchor-block masking in the training recipe - Remove the four `add_noise` unit tests that covered the deleted method Co-Authored-By: Kashif Rasul <kashif@huggingface.co>
Speculative decoding with sliding-window and Mamba/linear-attention caches has no efficient general solution: snapshot/restore requires re-running the target on accepted tokens, and crop() silently no-ops on recurrent states. Removing the snapshot_cache / restore_cache / cache_has_linear_attention scheduler methods and the associated pipeline rollback logic; DFlash now requires a standard full-attention DynamicCache target model. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DynamicCache() without a config creates an empty shell with no layer objects. Models like Qwen3.5 that call has_previous_state() on the passed cache raise ValueError when they find no LinearAttentionLayer entries. Passing config=target_model.config (and config=draft_model.config) causes DynamicCache.__init__ to pre-build the correct layer types (LinearAttentionLayer for linear_attention, DynamicLayer for full_attention) from config.layer_types, matching what the model would create if it initialized the cache itself. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Collaborator
|
Hi, when I try out the example in import torch
from diffusers import DFlashPipeline
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer
draft = AutoModel.from_pretrained(
"z-lab/Qwen3-8B-DFlash-b16", trust_remote_code=True, dtype=torch.bfloat16
)
target = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
pipe = DFlashPipeline(draft_model=draft, target_model=target, tokenizer=tokenizer)
out = pipe(prompt="How many positive whole-number divisors does 196 have?")
print(out.texts[0])I get the following output: Is the output being cut off expected? |
Collaborator
|
Also, I saw that huggingface/transformers#45846 was closed, will the KV cache work correctly without it? |
Contributor
Author
|
thanks @dg845 checking |
…sers into add-dflash-pipeline
The single-anchor sampler caps training signal at ~1/512 of paper Appendix A.1,
which makes the resulting draft model accept far fewer tokens at inference than
the paper reports.
This change brings the training script in line with paper §4.2 / Appendix A.1:
- `--num_anchors` (default 512, paper) — N anchor blocks per sequence, processed
in a single forward via a sparse block-diagonal attention pattern.
- `--attention_backend {sdpa, flex_attention}` (default flex_attention) — flex
is required for N=512 (sdpa materialises a dense [B,1,N*bs,S+N*bs] mask that
OOMs even on 80GB H100s at N=512, seq=4096).
- `--no_overlap_anchors` to opt out of the paper's independent (overlapping)
anchor sampling and use stars-and-bars non-overlapping anchors instead.
- New inline helpers `sample_anchor_starts` and `build_dflash_mask`. The latter
returns a FlexAttention BlockMask or a dense additive mask depending on
backend; both encode "block b sees context < anchor[b] and its own noise only".
- `draft.config._attn_implementation` is set from the CLI flag so the draft's
per-layer attention routes through transformers' ALL_ATTENTION_FUNCTIONS.
- Loss decay weights are tiled across N blocks; the existing Eq. 4 weights
computation stays put.
Module docstring now points users at vLLM/SGLang for the §5.1 target-regenerated
training data step, which is a prerequisite for paper-comparable acceptance
length.
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
mask_mod closes over fresh anchor tensors every step but the create_block_mask machinery itself is the same, so wrap it in a module-level torch.compile once. End-to-end draft-model compile is left to Accelerate's --dynamo_backend so the user can opt in/out without script changes. Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Matches the pattern in examples/text_to_image/train_text_to_image_sdxl.py and other diffusers training examples: when a user enables Accelerate's --dynamo_backend, accelerator.unwrap_model returns the compiled wrapper and save_pretrained needs ._orig_mod instead. The helper handles both cases. Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Concrete pipeline (matching SpecForge's prepare_data.py -> regenerate_train_data.py workflow): standardise to JSONL Conversation Format, serve target via SGLang or vLLM (same OpenAI-compatible API), re-roll assistant turns with temperature 0.7-0.8 (NOT greedy — diversity helps the draft generalise), concurrency 64-128 per server. Lists three concrete tooling options users already have available. Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
… pipeline Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Added the DFlash pipeline as a stanalone PR extracted from #12911
Fixes # (issue)
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.