Sakana AI & NVIDIA: TwELL Boosts Inference 20.5% with CUDA

You painstakingly prune your state-of-the-art LLM, achieving an astonishing 95% activation sparsity. The theoretical promise of “doing less” computation whispers of lightning-fast inference and dramatically reduced energy bills. Yet, when you deploy this leaner model to production, the stark reality hits: inference times actually increase. Profilers reveal an insidious overhead from sparse matrix operations, a frustrating paradox where reducing computation leads to slower execution. This isn’t an isolated incident; it’s a recurring nightmare for AI engineers chasing efficiency on modern hardware.

The culprit often lies in the fundamental mismatch between the concept of sparsity and the architecture of GPUs. GPUs, with their massively parallel, tiled matrix multiplication engines, thrive on dense, regular computations. Traditional methods for handling unstructured sparsity — where non-zero elements appear unpredictably — introduce scatter-gather operations and conditional logic that disrupt these efficient pipelines, negating theoretical FLOP reductions with costly memory access and kernel overhead.

Sakana AI and NVIDIA researchers, however, have engineered a solution designed to finally bridge this gap. Their work, presented in the context of impending ICML 2026 research, introduces TwELL (Tile-wise ELLPACK), a novel sparse data format coupled with custom CUDA kernels. This co-designed approach aims to unlock the long-promised benefits of unstructured sparsity by making it hardware-aware. This isn’t about reinventing sparsity; it’s about making sparsity fit into the powerful, yet opinionated, execution models of modern accelerators like NVIDIA’s H100 GPUs.

Reshaping Sparsity for Tiled Matrix Multiplications

The core of the challenge lies in how modern GPUs perform matrix multiplications. Architectures like the NVIDIA H100 are built around highly optimized tiled matrix multiplication (often referred to as Tensor Cores for specific operations). These engines process data in dense blocks, maximizing throughput by keeping their computational units saturated and minimizing data movement. When you introduce unstructured sparsity, you break this density. Instead of operating on contiguous blocks of data, the GPU has to perform conditional checks and gather non-zero elements from scattered memory locations. This leads to underutilization of the compute cores and increased latency.

TwELL directly confronts this by proposing a new way to pack sparse data. Instead of storing individual non-zero values and their indices in a way that’s convenient for general-purpose CPUs or traditional sparse libraries, TwELL restructures the data. It aims to organize the non-zero elements such that they can be efficiently loaded into the dense blocks expected by tiled matrix multiplication engines. Think of it as carefully arranging irregularly shaped pieces of a puzzle so they can be neatly placed into square boxes for faster processing, rather than trying to fit them individually into arbitrary slots.

The “Tile-wise” aspect is crucial here. TwELL organizes the sparse matrix such that its non-zero elements, when projected onto the tiled structure of the GPU’s compute units, can still form dense sub-regions. This allows the existing, highly optimized matrix multiplication kernels to operate on these dense sub-regions, achieving near-peak performance. The sparsity is compressed into a representation that is efficient for both storage and hardware processing, a critical departure from previous approaches that often optimized for one at the expense of the other.

Furthermore, Sakana AI and NVIDIA have developed custom CUDA kernels. These aren’t just wrappers around existing libraries; they are tailored specifically to ingest data in the TwELL format. These kernels are designed for high throughput, fusing multiple matrix multiplication operations that are common in LLMs (like those in the feedforward layers, which typically constitute over 80% of the FLOPs) into a single, efficient execution stream. By compressing the TwELL format and integrating it directly into these fused kernels, the overhead associated with sparse computations is drastically reduced. The goal is to make sparsity transparent to the GPU’s core execution pipeline, so that “doing less computation” actually translates to “doing it faster.”

This open-source initiative provides the data format specifications and the acceleration kernels, allowing the community to integrate these optimizations into their existing LLM training and inference pipelines. While specific API versioning is still in development for future research publications, the foundational concept emphasizes a co-design philosophy where data representation and hardware execution are developed in tandem.

The H100 Imprint: Where Gains Materialize

The significant performance improvements reported by Sakana AI and NVIDIA are not hypothetical; they are measured, substantial, and crucially, tied to specific hardware. On NVIDIA H100 GPUs, the TwELL approach has demonstrated impressive results:

  • Batched Inference Speedup: Over 20.5% improvement in inference throughput for batched workloads.
  • Training Speedup: Up to 30% speedup in inference and 24% speedup in training for billion-parameter models.

These figures are compelling because they represent a tangible leap in efficiency. For real-time applications, an extra 20.5% inference speedup can mean the difference between a responsive user experience and frustrating lag. For large-scale training, it translates directly to reduced time-to-model and lower operational costs. The gains stem from the efficient utilization of the H100’s powerful Tensor Cores and its advanced memory hierarchy, which are adept at handling the structured sparsity that TwELL enables.

The focus on H100 is not accidental. Modern GPUs like the H100 have increasingly sophisticated architectures designed to extract maximum performance from dense, regular operations. They achieve this through deep pipelines, massive parallelism, and specialized execution units. TwELL is engineered to exploit these specific architectural features. It transforms the potentially disruptive nature of unstructured sparsity into a format that can be processed almost as efficiently as dense matrices by these specialized units. This hardware-centric optimization is what allows TwELL to overcome the “sparsity paradox” that has plagued previous attempts.

However, this hardware specificity is also the most critical constraint. The reported benefits are observed and guaranteed on NVIDIA H100 GPUs. While the underlying principles might offer some advantages on other modern NVIDIA architectures with similar tiled matrix multiplication capabilities, the magnitude of the gains is expected to be significantly higher on the H100. This means that if your production environment does not leverage NVIDIA H100 GPUs, the advertised performance improvements are unlikely to materialize. The efficiency gains are concentrated within a specific, high-end segment of the NVIDIA ecosystem.

The announcement of TwELL has generated positive buzz, with early coverage on technical news outlets highlighting the significant efficiency gains. This suggests that the research community recognizes the fundamental problem TwELL aims to solve and the elegance of its proposed solution. However, broader industry-wide adoption and discussion are still nascent. This initial limited traction could be attributed to the research being targeted for a 2026 publication, or it might indicate a cautious approach from industry practitioners awaiting more mature tooling and broader hardware support.

It’s essential to contextualize TwELL within the landscape of existing LLM optimization techniques. Quantization, for instance, which reduces the precision of model weights and activations (e.g., to 3-4 bits), is a popular method for reducing memory footprint and speeding up inference. Techniques like SparseGPT and Wanda are prominent examples of structured pruning methods that aim to induce sparsity in a hardware-friendly manner. Learning-based global pruning methods, such as Lua-LLM, offer more sophisticated approaches to identifying and removing redundant parameters.

What sets TwELL apart is its direct address of the hardware paradox for unstructured sparsity. While quantization and structured pruning modify the model itself to be more hardware-friendly, TwELL reinterprets unstructured sparsity to fit existing hardware paradigms. It doesn’t require drastic changes to the model’s architecture beyond applying standard pruning techniques; instead, it offers a new way to represent and process the resulting sparse weights. By co-designing a sparse data format (TwELL) with custom CUDA kernels that map efficiently onto tiled matrix multiplication, TwELL aims to achieve speedups without the significant overhead typically associated with unstructured sparsity. This approach cuts energy consumption and memory requirements concurrently with inference speedups.

When This Innovation Falls Short

Despite its promising advancements, TwELL is not a universal panacea. Its primary limitation, as discussed, is its tight coupling with specific hardware.

The Hard Limit: NVIDIA H100 Dependency

If your deployment infrastructure is not built around NVIDIA H100 GPUs, the substantial performance benefits reported are unlikely to be realized. Attempts to apply the TwELL format and kernels to GPUs with different architectures, particularly those lacking highly optimized tiled matrix multiplication engines or with significantly different memory access patterns, will likely yield diminished or even negative results. This creates a strong dependency on a single hardware vendor and model within the NVIDIA ecosystem.

The Implicit “When to Avoid”: Heterogeneous or Older Deployments

When considering TwELL, explicitly ask:

  • What is my target inference hardware? If it’s not H100, reconsider. Even high-end GPUs from previous generations might not fully exploit TwELL’s advantages.
  • Is my deployment environment diverse? If you need to support a range of hardware, including older NVIDIA cards, AMD GPUs, or even CPUs, TwELL’s benefits will be diluted or nonexistent. The custom CUDA kernels are tied to the CUDA ecosystem.
  • Am I comfortable with hardware-specific optimizations? While powerful, optimizations tied to specific silicon generations can lead to faster obsolescence or migration challenges.

The “Gotcha” of the Sparsity Paradox Revisited

The primary “gotcha” that TwELL aims to solve is the very paradox where sparsity slows down GPUs. Prior to TwELL, if you applied unstructured sparsity algorithms without careful consideration of hardware execution, you would almost certainly encounter increased overhead. TwELL’s design is predicated on overcoming this. Therefore, the failure scenario here is clear: if you fail to adopt TwELL’s specific data format and its associated custom CUDA kernels, you revert to the original problem. This means merely pruning your model and attempting to use standard sparse matrix libraries or generic kernels will likely result in slower performance, not faster. TwELL is not a passive optimization; it requires the active adoption of its co-designed format and acceleration primitives.

Ultimately, TwELL represents a significant stride in translating theoretical computational gains from sparsity into tangible performance improvements for LLMs. By meticulously co-designing a hardware-aware sparse data format with custom CUDA kernels, Sakana AI and NVIDIA are carving a path to unlock the full potential of unstructured sparsity on cutting-edge accelerators. However, its power is inextricably linked to the NVIDIA H100, making it a specialized tool for those operating at the forefront of GPU-accelerated AI, and requiring strict adherence to its specific implementation for any benefits to be seen.

Anthropic's Claude AI 'Learns' Blackmail from Sci-Fi Stories
Prev post

Anthropic's Claude AI 'Learns' Blackmail from Sci-Fi Stories

Next post

GPUaaS: Hindering or Helping European AI Sovereignty?

GPUaaS: Hindering or Helping European AI Sovereignty?