Japanese synthetic intelligence startup Sakana AI claims it has developed an “AI CUDA Engineer,” which may convert PyTorch workloads into kernels that run workloads on NVIDIA’s GPU {hardware} with a one- to two-orders-of-magnitude speed-up.
“The AI CUDA Engineer [is] the primary complete agentic framework for totally automated CUDA kernel discovery and optimization,” the corporate claims of its creation. “The AI CUDA Engineer is an agentic framework that leverages frontier LLMs with the objective of automating the conversion of ordinary PyTorch code into extremely optimized CUDA kernels. By way of using evolutionary optimization, and leveraging ideas in evolutionary computation, corresponding to ‘crossover’ operations and ‘innovation archive’ to find promising ‘stepping stone’ kernels, our proposed framework is ready to not solely automate the method of changing PyTorch modules to CUDA kernels, however our extremely optimized CUDA kernels usually obtain speedups which have considerably sooner runtime.”
Sakana AI says it has developed an LLM-based “AI CUDA Engineer” to dramatically enhance PyTorch efficiency. (📷: Sakana AI)
The corporate claims that the CUDA kernels generated by the “engineer” function between 10 and 100 instances sooner than then unique PyTorch for “frequent PyTorch operations.” For tasks which already use CUDA kernels, the good points are much less — however Sakana AI nonetheless claims its software program can ship a high-optimized kernel that delivers as much as a fivefold pace enhance.
These efficiency good points come via a three-step course of: in step one the device interprets PyTorch code into working CUDA kernels; within the second step these kernels undergo a “survival of the fittest” evolutionary optimization course of, which features a kernel crossover prompting technique able to combining separate optimized kernels to enhance efficiency additional; lastly, the system makes use of a long-term reminiscence dubbed the “innovation archive” to supply further efficiency enhancements.
Essentially the most spectacular good points come from specialised workloads, although smaller efficiency boosts can be found to complete ML architectures too. (📷: Sakana AI)
The largest good points come from comparatively particular workloads — corresponding to diagonal matrix multiplication operations, which run round 57 instances sooner — although the corporate claims the method can be used to optimize the efficiency of whole machine studying architectures: VanillaRNNHidden confirmed a 7.02× efficiency acquire over native operation, Sakana AI claims, whereas the EfficientNetB2 imaginative and prescient structure ran 1.24× sooner and the LeNet5 imaginative and prescient structure ran 1.4× sooner. Some early take a look at outcomes needed to be thrown out, nonetheless, with Sakana AI discovering the big language mannequin (LLM)-based engineer had discovered methods to “cheat” the benchmarks by reusing outcomes from an earlier PyTorch run.
Extra data on Sakana AI’s work, together with a paper on the mission, is on the market on the corporate’s web site; it has additionally launched a dataset of greater than 17,000 CUDA kernels protecting a wide range of PyTorch operations.