Create a Simple Embedded DSL for AI with ThunderKittens
The development team is proud of their open-source contributions to artificial intelligence, yet they note that many perceive FlashAttention as something alien. At a recent NeurIPS conference, the group tried to convey key ideas to the audience, but the gap between beautiful visuals and actual CUDA work remains too wide. In response, they developed a simple library called ThunderKittens, aimed at making it easier to express key technical ideas.
The name ThunderKittens was chosen intentionally, as it reflects a cute and appealing identity, contrasting with many serious titles in the field. The main idea is to keep tensor cores busy, as they account for 94% of the computational power on an H100. In ThunderKittens, the fundamental object is a matrix whose size fits the tensor core.
The ThunderKittens API is designed to be PyTorch-like, making it more accessible to AI specialists. The developers aim to leverage the full power of the host (CUDA or HIP) without hiding how accelerators work. This is crucial, as AI processors are constantly evolving, and to optimize their use, one must break free from software abstractions.
ThunderKittens serves as an embedded DSL that lies between the simplicity of embedded CUDA and Triton's focus on AI kernels. If you are familiar with CUDA, you can easily 'compile' ThunderKittens in your mind. The developers have already utilized ThunderKittens in their previous works and decided to share it with the community.
On the 4090 and A100 GPUs, ThunderKittens matches FA2 performance with just a few lines of code. On H100s, ThunderKittens operates faster than FA2, demonstrating that code cleanliness does not compromise speed. The team has developed several other kernels using ThunderKittens, achieving results that were not possible with Triton.
The developers believe that attending a two-hour CUDA session enabled many users to write code, which is a step towards simplifying the use of ThunderKittens. They emphasize that this is an art project and do not guarantee that they will address all user feedback. For now, ThunderKittens serves their enjoyment and is useful for the team, and they hope it clarifies key ideas.
To enhance visibility, an integrated version with the NanoGPT project has been released, which is also used in teaching. This collaboration with one of the best communicators in AI allows the developers to expand the project's capabilities.
Accelerate Attention with FlashAttention-3: New Capabilities and Performance
Together AI Enhances Fine-Tuning Service with Tool Support
Похожие статьи
Build a Solar Flare Detection System Using LSTM on STIX Data
Learn how to build a solar flare detection system using LSTM and STIX data.
Maximize AI Infrastructure Throughput with GPU Workload Consolidation
Optimize GPU usage in Kubernetes to enhance AI efficiency.
Accelerate Token Production in AI Factories with NVIDIA Mission Control
NVIDIA Mission Control 3.0 enhances token production in AI factories.