RightNow AI Launches AutoKernel for GPU Code Optimization

Source
RightNow AI Launches AutoKernel for GPU Code Optimization

Optimizing GPU code is one of the most challenging tasks in machine learning. Researchers from RightNow AI aim to fully automate this process. They have introduced AutoKernel, an open-source framework that utilizes an autonomous LLM agent loop for GPU kernel optimization for arbitrary PyTorch models. The core approach is simple: upload a model before going to bed and wake up to faster Triton kernels — no GPU expertise required.

A GPU kernel is a function that runs in parallel across thousands of GPU cores. When running transformer models like LLaMA or GPT-2, most compute time is spent in kernels for operations such as matrix multiplication, softmax, layer normalization, and attention. These kernels reside in libraries like cuBLAS and cuDNN or are automatically generated by PyTorch’s compilation pipeline. The challenge is that squeezing maximum performance from these kernels requires simultaneous reasoning about various factors like arithmetic intensity, memory coalescing, and tensor core instruction selection — skills that take years to develop.

AutoKernel was built in response to the scarcity of expertise in this area. The core insight is that an expert kernel engineer's workflow is itself a simple loop: write a candidate, benchmark it, keep improvements, discard regressions, and repeat. The framework mechanizes this loop. An LLM agent modifies a single file — kernel.py — a fixed benchmark harness verifies correctness and measures throughput, and the result determines whether the change persists.

Each experiment maps to a git commit. Kept experiments advance the branch, while reverted experiments are cleanly erased with git reset. The entire history is browsable with standard git tools, and experiment results are logged to a plain results.tsv file, which is human-readable and easily parsed by the agent. Each iteration takes approximately 90 seconds: 30 seconds for correctness checking, 30 seconds for performance benchmarking, and 30 seconds for agent reasoning and code modification.

AutoKernel also employs torch.profiler to capture per-kernel GPU time and ranks optimization targets using Amdahl’s law, which states that the overall speedup achievable is bounded by how much of the total runtime that component represents. This allows the system to efficiently determine which kernels require the most attention for optimization.

Related articles