Skip to content

cuda : fix KQ mask offset integer overflow in flash attention MMA kernel#23610

Open
fairydreaming wants to merge 1 commit into
ggml-org:masterfrom
fairydreaming:fattn-mma-kq-mask-integer-overflow
Open

cuda : fix KQ mask offset integer overflow in flash attention MMA kernel#23610
fairydreaming wants to merge 1 commit into
ggml-org:masterfrom
fairydreaming:fattn-mma-kq-mask-integer-overflow

Conversation

@fairydreaming
Copy link
Copy Markdown
Collaborator

Overview

When using long context and large ubatch size KQ mask becomes very large during prompt processing with the number of elements exceeding INT32_MAX (for example in GLM-5.1 max context size is 202752, so with 16k ubatch we have 3321888768 KQ mask elements). This results in negative j_vram*stride_mask mask_h array offset in flash attention MMA kernel as confirmed by printfs:

j_sram*(nbatch_fa + 8) = 1600, j_vram*stride_mask = -1976369152
j_sram*(nbatch_fa + 8) = 1600, j_vram*stride_mask = -1976369152
j_sram*(nbatch_fa + 8) = 1600, j_vram*stride_mask = -1976369152
j_sram*(nbatch_fa + 8) = 1600, j_vram*stride_mask = -1976369152
j_sram*(nbatch_fa + 8) = 1640, j_vram*stride_mask = -1976221696
j_sram*(nbatch_fa + 8) = 1640, j_vram*stride_mask = -1976221696
j_sram*(nbatch_fa + 8) = 1640, j_vram*stride_mask = -1976221696
j_sram*(nbatch_fa + 8) = 1640, j_vram*stride_mask = -1976221696

Casting j_vram to int64_t so that the calculated offset is also int64_t resolves this problem.

Fixes #23574.

Requirements

@fairydreaming fairydreaming requested a review from a team as a code owner May 24, 2026 12:18
@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 24, 2026
#pragma unroll
for (int j1 = 0; j1 < ncols1; j1 += stride_j) {
const int j_sram = j1 + threadIdx.y*cols_per_warp + threadIdx.x / (warp_size/cols_per_warp);
const int j_vram = fastmodulo(j0 + j_sram, ne01);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const int64_t j_vram = fastmodulo(j0 + j_sram, ne01);

I would say it's better to do the upcast on this line instead because an overflow of a 32 bit signed integer here would be upcast to a negative 64 bit integer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: GLM 5.x crashes/gibberish with high ubatch and long prompts

3 participants