Loading...
Loading...
12-10-2024
Want to run your own large language model (LLM) on your PC without relying on the cloud? Whether you’re after more privacy, control or just want to mess around with AI, setting up a local LLM is a pretty sweet option. And it’s easier than you might think.
In this post, I’ll walk you through how to get an LLM up and running on Windows using Ollama
Ollama is an open-source project that serves as a powerful and user-friendly platform for running LLMs on your local machine. It supports Linux (Systemd-powered distros), Windows, and macOS (Apple Silicon).
It's a command-line interface (CLI) tool that makes it easy to download and run LLMs locally and privately. With just a few commands, you can grab models like Llama 3, Mixtral, and other LLM models listed on it's library.
Before diving in, make sure your PC can handle running an LLM. Here’s what you’ll need:
You can actually run these models without a dedicated GPU and relying solely on the CPU, but it will be noticeably slower. To make things more manageable, it’s a good idea to use smaller models with fewer parameters or higher quantized versions, which will reduce the load and speed up performance on a CPU.
In this case, I will be using my laptop running RTX 3050 Ti with 16GB of system RAM and 4GB of GPU VRAM.
If you want to take advantage of your NVIDIA GPU for running local LLMs faster, installing CUDA is a must. CUDA allows your GPU to do the heavy lifting during inference, speeding things up significantly compared to using a CPU.
Here’s how you can get CUDA installed and set up on Windows:
First, make sure your NVIDIA GPU supports CUDA. Most modern NVIDIA GPUs do, but it’s always good to double-check.
Once you've confirmed your GPU is compatible, head over to the official NVIDIA website to download the CUDA Toolkit:
After downloading the installer, follow these steps:
cuDNN (CUDA Deep Neural Network library) is a GPU-accelerated library optimized for deep neural network tasks. It works on top of the CUDA platform, leveraging the powerful parallel processing capabilities of NVIDIA GPUs to speed up various deep learning operations like convolutions, pooling, normalization, activation functions, and tensor manipulations. While it's not strictly necessary for running LLMs, using cuDNN can significantly enhance performance and efficiency in deep learning tasks.
Here are the steps to install cuDNN on your system:
bin
, include
, and lib
directories into the corresponding CUDA Toolkit folders (C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\vX.X
).After installation, you’ll want to verify that CUDA is set up correctly:
nvcc --version
You should see details about the CUDA version installed, confirming it’s ready to go.
Informations about the installed CUDA
Sometimes the CUDA installer doesn't automatically add the path to your system. If nvcc --version
doesn't work, do the following:
vX.X
with your CUDA version number):C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\vX.X\bin
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\vX.X\libnvvp
To install Ollama on Windows, first open up it's download page.
Ollama download page
Then select Windows and click on Download for Windows (Preview). This should start the download, wait until it's finished then open it. You will be greeted with the installer, click on Install and wait for it to finish extracting the files.
After installation, the installer should close itself automatically, you can then open up Windows Terminal to verify the installation by running this command:
ollama --version
You should see this output:
Successful Ollama installation
When it comes to selecting a model for running local LLMs, it's essential to consider your hardware configuration. If you have limited resources, such as a lower-end CPU or less RAM, opting for models with fewer parameters is a smart move.
Larger models, while more powerful and capable of producing complex outputs, can be demanding on your system. They require more memory and processing power, which can lead to slower inference times or even crashes if your hardware can't handle the load.
Instead, look for smaller models specifically designed for efficiency. Models like GPT-2 or smaller versions of LLaMA are great choices, as they strike a balance between performance and resource consumption. Additionally, higher quantized models, which reduce the precision of the calculations, can also be beneficial. These models typically consume less memory and run faster, making them more suitable for limited hardware setups. Refer to Ollama model library for a list of models available. In this case I will be using the 1B (1 Billion parameters) version of LLaMA 3.2.
After you have chosen a model that you want to use, we can proceed do download and run the model. You can also check the details and informations about the model and how to run it by visiting the model's Ollama page but to keep it short we can just run this command:
ollama run llama3.2:1b
After running this command, Ollama should proceed to download the model you specified and prepare to infer it, wait until this process is finished.
Downloading the model
After it's finished, you should be greeted with an input, at this point your model has been successfully deployed. You can start typing prompts and the model should return back an response.
Chatting with the model
You can monitor the hardware usage by asking the model to generate a longer text, for example:
Generate me a 1000 words story
Then right click on your taskbar and open up Task Manager.
Monitor system usage while running the model
And there you go—you just set up and ran a local LLM on Windows using Ollama. Whether you’re experimenting, building a cool project, or just curious about how AI works, having your own LLM on your machine gives you tons of freedom to do whatever you want.
While using Ollama from the terminal is a great starting point, there's a whole lot more you can do with it. Ollama can provide a user-friendly web interface that allows you to interact with your models visually, enabling you to enter prompts, adjust settings, and view outputs without diving into the command line each time. This web UI is perfect for those who prefer a graphical approach or want to quickly experiment with various prompts and models without getting bogged down in terminal commands.
Additionally, if you’re looking to integrate Ollama into your applications, you can utilize its API to make calls to your local models, connecting your Ollama setup with web apps, mobile apps, or any other software that can send HTTP requests. This opens up the possibility of creating a chatbot on your website that pulls responses directly from your local model, allowing you to build all sorts of applications powered by LLMs without needing to host anything in the cloud. Make sure to check it's documentations for more details! That's all for this post, bye bye.