[LTX-2] Run Gemma-3 Text Encoder natively in JAX via TorchAX by mbohlool · Pull Request #398 · AI-Hypercomputer/maxdiffusion

mbohlool · 2026-05-04T20:08:36Z

Description

This PR transitions the LTX-2 pipeline's text encoding process to utilize TorchAX, bridging the Gemma-3 model natively into JAX and significantly optimizing memory usage to prevent TPU out-of-memory errors. Minor PyLint warnings across the pipeline and text encoder wrapper were also resolved during the refactor.

Key changes include:

TorchAX Integration: Replaced the eager PyTorch-based text encoder execution with the JAX-native TorchaxGemma3TextEncoder wrapping the HuggingFace Gemma3ForConditionalGeneration model. This allows full compiler optimization via JAX tracing.
VAE Memory Optimization: Updated the VAE decoding loop to conditionally apply sharding constraints depending on batch size. For batch_size <= 2, it utilizes standard VAE replication and slicing. For batch_size > 2, it dynamically disables sequential slicing and skips replication, applying NamedSharding constraints on the batch dimension of the latents across the mesh axes. This prevents JAX from trying to concatenate massive arrays on the TPU, avoiding HBM out-of-memory crashes.
Lint & Test Quality Cleanup: Addressed PyLint warnings across the pipeline, mock-patched the smoke tests to bypass loading the full 4B parameter text encoder when unnecessary, and ensured clean end-to-end execution.

Benchmarks

Performance comparison demonstrating latency and throughput improvements, based on robust averages of repeated runs (with the furthest outlier removed for each configuration).

Configuration	Text Encoding (CPU)	Text Encoding (TorchAX)	Text Encoding Impr.	Total Time (TE on CPU)	Total Time (TE on TorchAX)	Generation Impr.
Batch Size 1 (Latency Optimized)	3.55s	2.20s	+38.0%	12.77s	11.43s	+10.5%
Batch Size 1 (w/ Upsampler)	3.35s	2.60s	+22.4%	16.04s	15.47s	+3.6%
Batch Size 8 (Throughput Optimized)	23.50s	4.40s	+81.3%	80.81s	58.78s	+27.3%
Batch Size 8 (w/ Upsampler)	23.60s	4.65s	+80.3%	113.38s	85.61s	+24.5%

Note

Crucial VAE Memory Optimization Impact:
For Batch Size 8 (w/ Upsampler), running without the conditional VAE batch sharding constraints (enable_dynamic_vae_sharding=False) causes the generation to immediately fail with a TPU HBM Out-of-Memory (OOM) crash during VAE decoding.

By conditionally enabling enable_dynamic_vae_sharding=True for larger batches, the pipeline avoids the OOM completely and finishes in 85.61s (a +24.5% net generation speedup).

github-actions · 2026-05-04T20:08:45Z

e2e testgrid: https://8bcf50593faf4ea38060e236169827e5-dot-us-central1.composer.googleusercontent.com/dags/maxdiffusion_tpu_e2e/grid

github-actions · 2026-05-11T05:18:27Z

🤖 Hi @Perseus14, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

This PR successfully integrates TorchAX for the LTX-2 pipeline's text encoder, bringing significant performance improvements and memory optimizations on TPU. The transition from eager PyTorch to JAX-native execution is well-implemented, and the additional sharding constraints for both the text encoder and VAE are effective strategies for preventing OOM crashes.

🔍 General Feedback

TorchAX Integration: The use of TorchaxGemma3TextEncoder and the manual batch sharding logic is a great addition for efficiency.
Memory Management: The conditional sharding and slicing disabling in the VAE decoding loop correctly addresses HBM issues for larger batches.
Distributed Performance: One critical observation is the explicit un-sharding of text encoder hidden states to a single device, which should be avoided to ensure optimal performance in multi-host environments.
Code Cleanliness: Small refactors to use getattr instead of broad try/except blocks will improve maintainability.

mbohlool requested a review from entrpn as a code owner May 4, 2026 20:08

mbohlool force-pushed the text_encoder_tpu3 branch 5 times, most recently from a449a5c to d681d61 Compare May 6, 2026 07:44

Perseus14 reviewed May 7, 2026

View reviewed changes

Comment thread src/maxdiffusion/pipelines/ltx2/ltx2_pipeline.py Outdated

Comment thread src/maxdiffusion/pipelines/ltx2/ltx2_pipeline.py Outdated

Perseus14 added the gemini-review label May 11, 2026

github-actions Bot reviewed May 11, 2026

View reviewed changes

mbohlool force-pushed the text_encoder_tpu3 branch from d681d61 to 8caeb1c Compare May 27, 2026 21:58

Offload LTX-2 text encoder to TorchAX and resolve lint issues

7b28885

mbohlool force-pushed the text_encoder_tpu3 branch from 8caeb1c to 7b28885 Compare May 27, 2026 22:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LTX-2] Run Gemma-3 Text Encoder natively in JAX via TorchAX#398

[LTX-2] Run Gemma-3 Text Encoder natively in JAX via TorchAX#398
mbohlool wants to merge 1 commit into
mainfrom
text_encoder_tpu3

mbohlool commented May 4, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mbohlool commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Benchmarks

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mbohlool commented May 4, 2026 •

edited

Loading