FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell | My AI Guide (programmatic OG fallback)

Together AI announces FlashAttention-4, up to 1.3× faster than cuDNN on Blackwell

By Harsh Desai14 May 2026

TL;DR

Together AI announces FlashAttention-4. It delivers up to 1.3× faster performance than cuDNN on NVIDIA Blackwell.

What changed

Together AI released FlashAttention-4. It achieves up to 1.3 times faster performance than cuDNN on NVIDIA Blackwell GPUs. This version optimizes attention mechanisms for transformer models.

Why it matters

Developers gain quicker training cycles for large models with FlashAttention-4's 1.3 times speedup over cuDNN on Blackwell. It targets attention computation in workloads like LLM fine-tuning where cuDNN serves as the baseline.

What to watch for

Compare FlashAttention-4 directly against cuDNN using your Blackwell setup benchmarks. Test integration in PyTorch training scripts from Together AI's repository.

Who this matters for

Vibe Builders: Use FlashAttention-4 to reduce latency in your custom model inference pipelines.

Harsh’s take

FlashAttention-4 represents a significant incremental gain for infrastructure efficiency. By outperforming cuDNN on Blackwell hardware, it provides a clear path to lower compute costs and faster iteration cycles for anyone training large transformer models. The performance delta is meaningful enough to justify immediate testing in production environments where attention bottlenecks currently limit throughput.

Operators should prioritize integrating this update into existing PyTorch workflows to capture the 1.3x speedup. While the gains are specific to Blackwell architecture, the optimization of attention mechanisms is a fundamental improvement for model scalability. Focus on benchmarking your specific workloads against cuDNN to verify these results before committing to a full migration of your training stack.

by Harsh Desai

Source:together.ai

More AI news

Feature14 May 2026
Higgsfield Launches Supercomputer for Creative Pipelines
Higgsfield released Supercomputer. It runs entire creative pipelines from one chat agent.