Transform AI with the Together AI Kernels Team

The breakthrough came on a holiday weekend, Memorial Day 2022. While most of Silicon Valley was enjoying barbecues, Dan Fu, Tri Dao, and their colleagues were set to prove the AI establishment wrong. The prevailing belief was that transformer attention had already been optimized and that GPU experts had squeezed every drop of performance from the hardware. However, the release of FlashAttention changed that perception.

Andrej Karpathy, then Senior Director of AI at Tesla, tweeted about it, and soon the news spread through AI research channels. Dan recalls that they didn't expect such a reaction when they released their work, but Karpathy's tweet made them realize it had captured attention.

Previous research on sparsity and low-rank methods showed theoretical speedups but only 10% real performance gains. The FlashAttention team took a different approach by focusing on actual GPU memory movement and compute patterns. By applying principles from classic database systems, they achieved speedups of 2-3 times.

For the researchers, the implications were clear: there remained enormous untapped potential in GPU optimization. This single paper became the foundation for what is now one of the most impactful kernel research teams in AI and a critical building block of the AI Native Cloud.

Many people do not understand that having the best models and hardware is not enough for successful AI. The bottleneck lies in the software layer that translates mathematical operations into silicon instructions. Kernels play a crucial role in this process, enabling the full power of hardware to be utilized. If implemented incorrectly, the hardware remains idle.

By March 2025, our kernels team had grown to about 15 people, a mix of ML researchers and GPU veterans. We had just gained access to NVIDIA's new Blackwell GPUs, and the task was clear: create optimized kernels in one week while NVIDIA had spent a year on this.

To tackle this challenge, we developed the ThunderKittens library, which significantly simplified the process of creating kernels for the new hardware generation. ThunderKittens leverages NVIDIA's tensor cores, reducing what once took over 1000 lines of code to just 100-200. As a result, within one week, we produced some of the fastest kernels for Blackwell, achieving up to 2x speedups over cuBLAS.

Transform AI with the Together AI Kernels Team

Похожие статьи

Explore Together AI Innovations at NVIDIA GTC 2026

Create pixel art with Retro Diffusion models on Replicate

Compare Image Editing Models for Optimal Choice