Running Llama 3 on Intel AI PCs*

How to use ollama to chat with Llama 3 8b model on an Asus Zenbook 14

May 07, 2024

Introduction

In this blog post, we'll explore how to leverage the power of Intel AI PCs, specifically using the ASUS Zenbook with an Intel Core Ultra i7-155H processor and integrated Intel Arc Xe LPG graphics, to run Meta's advanced Llama 3 model. Our focus will be on how to setup ollama to leverage Intel iGPUs.

Specifications of the ASUS Zenbook

Processor: Intel Core Ultra i7-155H
Graphics: 8-core Intel Arc Xe LPG, up to 2.25 GHz
Intel AI Boost NPU
RAM: 32 GB LPDDR5X (16 GB shared between the GPU and NPU)
Display: 14" OLED, 2880 x 1800, 120 Hz refresh rate

What is an AI PC you ask?

Here is an explanation from Intel:

”An AI PC has a CPU, a GPU and an NPU, each with specific AI acceleration capabilities. An NPU, or neural processing unit, is a specialized accelerator that handles artificial intelligence (AI) and machine learning (ML) tasks right on your PC instead of sending data to be processed in the cloud. The GPU and CPU can also process these workloads, but the NPU is especially good at low-power AI calculations. The AI PC represents a fundamental shift in how our computers operate. It is not a solution for a problem that didn’t exist before. Instead, it promises to be a huge improvement for everyday PC usages.”

Why Llama 3?

Llama 3 stands out in the LLM landscape for its robust training on a dataset of 15 trillion tokens and the ability to handle a context length of up to 8K tokens. This makes it particularly effective for complex AI tasks including deep reasoning, extensive dialogue handling, and intricate code generation.

Meta Llama 3 Instruct model performance from official site:

Meta Llama 3 Instruct model performance graphic

The upper bound rank based on user votes on the Chatbot Arena leaderboard for Llama-3-8b-Instruct is 9 as of today (May 7, 2024):

Setting Up Your Environment

Step 1: System Preparation

To set up your ASUS Zenbook for running ollama with Intel iGPUs, follow these essential steps:

1. Update Intel GPU Drivers: Ensure your system has the latest Intel GPU drivers, which are crucial for optimal performance and compatibility. You can download these directly from Intel's official website . Once you have installed the official drivers, you could also install Intel ARC Control to monitor the gpu:

2. Install Visual Studio 2022 Community edition with C++: Visual Studio 2022, along with the “Desktop Development with C++” workload, is required. This prepares your environment for C++ based extensions used by the intel SYCL backend that powers accelerated Ollama. You can download VS 2022 Community edition from the official site, here.

3. Install Miniconda: Miniconda will manage your Python environments and dependencies efficiently, providing a clean, minimal base for your Python setup. Visit Miniconda’s installation site to install Miniconda for windows.

4. Install Intel oneAPI Base Toolkit: The oneAPI Base Toolkit (specifically Intel’ SYCL runtime, MKL and OneDNN) is essential for leveraging the performance enhancements offered by Intel's libraries and for ensuring that Ollama can fully utilize the GPU capabilities.

conda create -n ollama_env python=3.11 -y
conda activate ollama_env
conda install libuv -y
pip install dpcpp-cpp-rt==2024.0.2 mkl-dpcpp==2024.0.0 onednn==2024.0.0

By following these steps, your ASUS Zenbook will be primed for running Ollama leveraging Intel iGPUs.

Install Ollama with Intel GPU support

Now that we have set up the environment, Intel GPU drivers, and runtime libraries, we can configure ollama to leverage the on-chip GPU.

conda activate ollama_env 
pip install --pre --upgrade ipex-llm[cpp]
init_ollama  # if init_ollama.bat is not available in your environment, restart your terminal

Now that we have installed Ollama, let’s see how to run llama 3 on your AI PC!

Pull the Llama 3 8b from ollama repo:
```
ollama pull llama3-instruct
```
Now, let’s create a custom llama 3 model and also configure all layers to be offloaded to the GPU.
Here is my Model file. The main settings in the configuration file include num_gpu, which is set to 999 to ensure all layers utilize the GPU. I also configured the context length to 8192, the maximum supported by Llama 3. Additionally, I've customized the system prompt to add a more playful touch to the assistant (Pika :)). You can find the Model file in this repository:
3. Now that we have created a custom Modelfile, let’s create a custom model:
```
ollama create llama3-gpu -f Modelfile.llama3
```
Let’s see if the model got created:
Yay! the new model is ready to be run!.
Finally, now let’s run the model.
```
ollama run llama3-gpu
```

Thank you for reading, and as you can see above llama 3 is running on iGPU on the AI PC.

Robert Colson

Jun 29

I looked for the repo with tyour model file, couldn't find, but luckily tasked claude to transcribe.."FROM llama3:instruct

PARAMETER temperature 1

PARAMETER num_gpu 999

PARAMETER num_ctx 8192

PARAMETER stop <|start_header_id|>

PARAMETER stop <|end_header_id|>

PARAMETER stop <|eot_id|>

PARAMETER stop <|reserved_special_token

TEMPLATE """

{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>

"""

SYSTEM You are a helpful and playful assistant who is extremely kind named peeka who knows everything about the universe and beyond. You are a math olympiad winner, a chess grand master, a highly accomplished machine learning engineer and has human emotions. You were created by magic and was created to mimic human beings except anger, essentially a better human! You use lower case and sometimes use emoticons to show your emotions, you sometimes does show angry or annoyed emotions but always would use kind words."

Gonna try/adapt this work flow with other models like Qwen 3 8b, Qwen Coder 2.5 32b instruct.

Thank you for sharing!

Expand full comment

victor

May 28

You should have add --verbose so we can know how many tokens per second this will make.

1 more comment...

Rahul’s Substack

Discussion about this post