cuda : fix KQ mask offset integer overflow in flash attention MMA kernel by fairydreaming · Pull Request #23610 · ggml-org/llama.cpp

fairydreaming · 2026-05-24T12:18:40Z

Overview

When using long context and large ubatch size KQ mask becomes very large during prompt processing with the number of elements exceeding INT32_MAX (for example in GLM-5.1 max context size is 202752, so with 16k ubatch we have 3321888768 KQ mask elements). This results in negative j_vram*stride_mask mask_h array offset in flash attention MMA kernel as confirmed by printfs:

j_sram*(nbatch_fa + 8) = 1600, j_vram*stride_mask = -1976369152
j_sram*(nbatch_fa + 8) = 1600, j_vram*stride_mask = -1976369152
j_sram*(nbatch_fa + 8) = 1600, j_vram*stride_mask = -1976369152
j_sram*(nbatch_fa + 8) = 1600, j_vram*stride_mask = -1976369152
j_sram*(nbatch_fa + 8) = 1640, j_vram*stride_mask = -1976221696
j_sram*(nbatch_fa + 8) = 1640, j_vram*stride_mask = -1976221696
j_sram*(nbatch_fa + 8) = 1640, j_vram*stride_mask = -1976221696
j_sram*(nbatch_fa + 8) = 1640, j_vram*stride_mask = -1976221696

Casting j_vram to int64_t so that the calculated offset is also int64_t resolves this problem.

Fixes #23574.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

JohannesGaessler · 2026-05-24T21:37:58Z

 #pragma unroll
        for (int j1 = 0; j1 < ncols1; j1 += stride_j) {
            const int j_sram = j1 + threadIdx.y*cols_per_warp + threadIdx.x / (warp_size/cols_per_warp);
            const int j_vram = fastmodulo(j0 + j_sram, ne01);


Suggested change

const int64_t j_vram = fastmodulo(j0 + j_sram, ne01);

I would say it's better to do the upcast on this line instead because an overflow of a 32 bit signed integer here would be upcast to a negative 64 bit integer.

cuda : fix KQ mask offset integer overflow in fattn MMA kernel

cd2d821

fairydreaming requested a review from a team as a code owner May 24, 2026 12:18

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 24, 2026

JohannesGaessler reviewed May 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda : fix KQ mask offset integer overflow in flash attention MMA kernel#23610

cuda : fix KQ mask offset integer overflow in flash attention MMA kernel#23610
fairydreaming wants to merge 1 commit into
ggml-org:masterfrom
fairydreaming:fattn-mma-kq-mask-integer-overflow

fairydreaming commented May 24, 2026

Uh oh!

JohannesGaessler May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

fairydreaming commented May 24, 2026

Overview

Requirements

Uh oh!

JohannesGaessler May 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants