Running Llama 3 on Intel AI PCs*
How to use ollama to chat with Llama 3 8b model on an Asus Zenbook 14
Introduction
In this blog post, we'll explore how to leverage the power of Intel AI PCs, specifically using the ASUS Zenbook with an Intel Core Ultra i7-155H processor and integrated Intel Arc Xe LPG graphics, to run Meta's advanced Llama 3 model. Our focus will be on how to setup ollama to leverage Intel iGPUs.
Specifications of the ASUS Zenbook
Processor: Intel Core Ultra i7-155H
Graphics: 8-core Intel Arc Xe LPG, up to 2.25 GHz
Intel AI Boost NPU
RAM: 32 GB LPDDR5X (16 GB shared between the GPU and NPU)
Display: 14" OLED, 2880 x 1800, 120 Hz refresh rate
What is an AI PC you ask?
Here is an explanation from Intel:
”An AI PC has a CPU, a GPU and an NPU, each with specific AI acceleration capabilities. An NPU, or neural processing unit, is a specialized accelerator that handles artificial intelligence (AI) and machine learning (ML) tasks right on your PC instead of sending data to be processed in the cloud. The GPU and CPU can also process these workloads, but the NPU is especially good at low-power AI calculations. The AI PC represents a fundamental shift in how our computers operate. It is not a solution for a problem that didn’t exist before. Instead, it promises to be a huge improvement for everyday PC usages.”
Why Llama 3?
Llama 3 stands out in the LLM landscape for its robust training on a dataset of 15 trillion tokens and the ability to handle a context length of up to 8K tokens. This makes it particularly effective for complex AI tasks including deep reasoning, extensive dialogue handling, and intricate code generation.
Meta Llama 3 Instruct model performance from official site:
The upper bound rank based on user votes on the Chatbot Arena leaderboard for Llama-3-8b-Instruct is 9 as of today (May 7, 2024):
Setting Up Your Environment
Step 1: System Preparation
To set up your ASUS Zenbook for running ollama with Intel iGPUs, follow these essential steps:
1. Update Intel GPU Drivers: Ensure your system has the latest Intel GPU drivers, which are crucial for optimal performance and compatibility. You can download these directly from Intel's official website . Once you have installed the official drivers, you could also install Intel ARC Control to monitor the gpu:
.
2. Install Visual Studio 2022 Community edition with C++: Visual Studio 2022, along with the “Desktop Development with C++” workload, is required. This prepares your environment for C++ based extensions used by the intel SYCL backend that powers accelerated Ollama. You can download VS 2022 Community edition from the official site, here.
3. Install Miniconda: Miniconda will manage your Python environments and dependencies efficiently, providing a clean, minimal base for your Python setup. Visit Miniconda’s installation site to install Miniconda for windows.
4. Install Intel oneAPI Base Toolkit: The oneAPI Base Toolkit (specifically Intel’ SYCL runtime, MKL and OneDNN) is essential for leveraging the performance enhancements offered by Intel's libraries and for ensuring that Ollama can fully utilize the GPU capabilities.
conda create -n ollama_env python=3.11 -y
conda activate ollama_env
conda install libuv -y
pip install dpcpp-cpp-rt==2024.0.2 mkl-dpcpp==2024.0.0 onednn==2024.0.0
By following these steps, your ASUS Zenbook will be primed for running Ollama leveraging Intel iGPUs.
Install Ollama with Intel GPU support
Now that we have set up the environment, Intel GPU drivers, and runtime libraries, we can configure ollama to leverage the on-chip GPU.
conda activate ollama_env
pip install --pre --upgrade ipex-llm[cpp]
init_ollama # if init_ollama.bat is not available in your environment, restart your terminal
Now that we have installed Ollama, let’s see how to run llama 3 on your AI PC!
Pull the Llama 3 8b from ollama repo:
ollama pull llama3-instruct
Now, let’s create a custom llama 3 model and also configure all layers to be offloaded to the GPU.
Here is my Model file. The main settings in the configuration file include num_gpu, which is set to 999 to ensure all layers utilize the GPU. I also configured the context length to 8192, the maximum supported by Llama 3. Additionally, I've customized the system prompt to add a more playful touch to the assistant (Pika :)). You can find the Model file in this repository:
3. Now that we have created a custom Modelfile, let’s create a custom model:
ollama create llama3-gpu -f Modelfile.llama3
Let’s see if the model got created:
Yay! the new model is ready to be run!.
Finally, now let’s run the model.
ollama run llama3-gpu
Thank you for reading, and as you can see above llama 3 is running on iGPU on the AI PC.