Swapping Iron; making AI code designed from Nvidia run on Intel Gaudi

Swap Icon cover

There’s a growing trend in deploying MML solutions on premise, using open source models and many existing hardware solutions. Sometimes, however, the model to be used doesn’t match the bare metal it is to be running on. In the case study below, Vstorm engineering team was tasked to swap iron. Without calling our customer by name, we’ll illustrate what that entails by showing how we ported Llama, made for NVidia, to run locally on Intel Gaudi architecture. 

What is LLama? 

The Lama.cpp is a groundbreaking open-source project that has revolutionized how we run Large Language Models on personal computers and various hardware setups. At its core, it’s a lightweight C/C++ implementation designed to run LLMs with remarkable efficiency and minimal complexity.

The popularity of llama.cpp is evident in its impressive GitHub metrics, with over 74,500 stars and 10,800 forks, making it one of the most prominent AI infrastructure projects in the open-source community. Its influence extends far beyond its direct usage, as its core technology has become a fundamental building block for numerous other AI projects, including popular tools like oLlama, effectively establishing Llama.cpp as a cornerstone of local LLM deployment.

On what can you run Llama?

At the heart of Llama.cpp lies ggml, a specialized C/C++ library designed specifically for Transformer model inference. While ggml started as part of the Llama.cpp project it has evolved into a powerful foundation that handles the core tensor operations and hardware acceleration features. Think of ggml as the engine room of the Llama.cpp, providing a unified interface for different hardware backends while focusing on efficient Transformer computations.

Llama.cpp supports an impressive array of backends for different hardware configurations:

  1. CPU backends (standard computations & optimized BLAS operations)
  2. NVIDIA and AMD GPU support (both CUDA & HIP)
  3. Apple-specific optimization using Metal framework
  4. Cross-platform GPU acceleration using Vulkan, OpenCL & Kompute
  5. Specialized hardware support for Intel GPUs, Moore Threads GPUs, and Huawei’s CANN 

One platform was missing at the time of our project — and that was the Intel Gaudi  AI accelerators our customer needed to run Llama on. All we needed to do was to roll up our sleeves and make Llama work on Intel Gaudí, too. 

What is Intel Gaudi

Gaudi accelerators are new specialized chips from Intel designed to speed up deep learning tasks like training and using large AI models. Although Intel chips were first used for neural network accelerators back in 1993, since 2012, there has been a massive shift to Nvidia chips when their graphic cards were first successfully adopted to train a deep learning network (AlexNet) for a championship-winning image recognition solution. While Nvidia is the market leader, Intel has its stake in the game with increasingly powerful Gaudi product offerings.  

How Vstorm helps leveraging AI Accelerators

Over the course of the last quarter, we helped customers port their solutions to the Intel architecture. From Stable Diffusion to BERTopic, we worked with ML-based and LLM-based solutions to make them leverage Intel Architecture.

Differences between NVIDIA and Intel

Although on the surface both solutions serve the same purpose, their internal architecture is diametrically different. In principle, both support identical types of calculations – matrix and tensor operations, media encoding and decoding, network communication, and native memory, the way data is processed is different. It is similar to having two people with different educational backgrounds approach one task. Their way of dealing with it will differ due to their backgrounds and habits of thinking. Similarly, between Intel Gaudi and NVidia CUDA platform, the differences appear in the way memory is managed, how calculations are ordered to be performed, how internal task distribution works, how internal synchronization is organized, and many others. 

Both solutions work according to different architectural designs: Nvidia uses a thread as a unit of computation. It makes engineers’ lives easier as it offers a relatively comfortable and efficient environment in which multiple threads execute the same instruction at the same time, but each with its own data. That’s what is called SIMT (Single Instruction Multiple Threads).

  • Nvidia platform (based on SIMT)  hides a lot of GPU hardware complexity (like scheduling, pipelining, synchronization) 
  • Engineers don’t need to manage individual cores or vector registers 
  • Developers can think in terms of threads rather than low-level GPU instructions.

Intel, on the other hand, approached the matter differently, deciding to use the most optimized programming model, called SIMD (Single Instruction, Multiple Data). In Gaudi, SIMD is the core execution model, and it works across vector processing units (VPUs) that operate on wide data types—think 256-bit or 512-bit registers handling 8, 16, 32 elements at once.

SIMT (NVIDIA) SIMD (Intel Gaudi)
Model Threads running same instruction One instruction runs over wide data
Abstraction Thread-centric Data-centric
Divergence Handled but costly Must avoid it (harder)
Programmer View Feels like multithreading Feels like vector math
Branching Flexible (but slows warps) Limited (must be uniform)

Key Conceptual Differences between SIMT & SIMD

In theory, SIMD (used by Intel Gaudi) is more optimized over SIMT (Nvidia) for AI and dense math workflows. It is so for the following reasons: 

  • Higher Instruction Efficiency; SIMD uses one instruction to operate on a wide data vector, while SIMT must still issue instructions per thread, even if they’re all the same. 
  • Clearer execution flow: SIMD is better for matrix multiplies and tensor ops. SIMT pays a penalty when threads within a warp take different branches — SIMD does not allow it.
  • Better Dense Linear Algebra: SIMD shines for matrix/tensor-heavy operations like those found in transformers, used in LLM’s

But in reality, SIMD is harder to code as it is less flexible and requires packing data into vector and avoid branches. NVIDIA’s SIMT is more programmer-friendly out of the box. That is why porting LLama was a worthy challenge for our team!

What’s it like switching iron for LLMs?

Switching hardware running LLM solutions is similar to the task of building one LEGO model using elements from another. There might not be assembly instructions, and some custom parts might be missing, but with a bit of insistence, it’s totally doable. 

Lego set

To make Gaudi run LLama we had to create backend support for the ggml library. For that, we identified 37 kernels in the ggml library that required our team’s attention. All kernels were designed for Nvidia CUDA. Porting each kernel for Intel architecture required writing dedicated TCP-C code kernel code for Gaudi 3 device (HPU) and Glue code for the Host (CPU) and a series of actions to ensure the consistency and correctness of input/output data. On a high level, the work required from our engineers to:

  • Working out functional differences between platforms
    In the process of porting the code, our team needed to closely inspect how the current Cuda ggml implementations work with the internal structure of the CUDA kernel. Due to different architectural approaches taken by Nvidia and Intel (SIMT vs SIMD) some functionalities had to be rewritten to fit into SIMD constraints. One of the challenging issues was related to data dependencies, and another was related to internal kernel synchronisation. As CUDA kernels implementation relies on synchronization of blocks of threads that perform various tasks, such a method could not be easily  mapped 1:1 into the Gaudi SIMD architecture.

  • Moving memory structures and memory transfers over to a different architecture
    During our work the team had to address the way memory is transferred. Due to differences between how Gaudi and Cuda memory transfers had to be reshaped to perform the same operations. The size and structure of exchanged data is strictly bound to the kernel demands. The kernels rewritten from CUDA to Gaudi required a different structure of data, thus the memory exchange has to be adjusted, too.

  • Changing workload distribution across platforms
    As each kernel handles some parallel operations, the execution of the code responsible for it relies on mapping existing workloads onto physical components so they may be operated on independently. Every computational problem has to be adjusted to fit specific constrains of target architecture. The challenge of adopting CUDA solutions to Gaudi architecture was to maintain the same level of computational parallelisation, meaning rewriting or restructuring kernel mappings in a way that the same level of instruction independence can be achieved.

Contact us to learn more about the challenges of moving solutions between different AI architectures and have Vstorm team help to design your on on-premise solutions on the hardware of your choice!

The LLM Book

The LLM Book explores the world of Artificial Intelligence and Large Language Models, examining their capabilities, technology, and adaptation.

Read it now
Services: