Unlike OpenAI and Google, Meta is taking a very welcomed open approach to Large Language Models (LLMs). Similarly to Stability AI’s now ubiquitous diffusion models, Meta has released their newest LLM, Llama 2, under a new permissive license. This license allow for commercial use of their new model, unlike the previous research-only license of Llama 1. This means that anyone, anywhere can use Llama 2 to do whatever they want (provided that its legal in your jurisdiction).
Great! How do I get it?
You do have to fill out a form with Meta to get access, but once that’s done you have a license to use Llama 2 for whatever you want! Once that’s done you can also sign up on HuggingFace for access so you don’t have to re-request a link every 24 hours.
How do I run it?
The official way to run Llama 2 is via their example repo and in their recipes repo, however this version is developed in Python. While I love Python, its slow to run on CPU and can eat RAM faster than Google Chrome.
My preferred method to run Llama is via ggerganov’s llama.cpp. This pure-C/C++ implementation is faster and more efficient than its official Python counterpart, and supports GPU acceleration via CUDA and Apple’s Metal. This significantly speeds up inference on CPU, and makes GPU inference more efficient. For example, here is Llama 2 13b Chat HF running on my M1 Pro Macbook in realtime.
It can even be built with MPI support for running massive models across multiple computers in a cluster!
Prerequisites
- Make
- A C Compiler
That’s it! Llama.cpp was designed to be a zero dependency way to run AI models, so you don’t need a lot to get it working on most systems!
Building
First, open a terminal, then clone and change directory into the repo.
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
Once that is done, you can build with make:
make
This builds the version for CPU inference only. I can’t find any information on running with GPU acceleration on Windows, so for now its probably faster to run the original Python version with GPU acceleration enabled for those who prefer Windows. However, if you lack a good enough GPU or don’t want to deal with the hassle of setting up all the Python dependencies, this is the fastest to set up and run option.
Building with GPU Acceleration
MacOS via Metal
If you are on MacOS, to build with Metal support, run the following.
make clean # if you already built it
LLAMA_METAL=1 make
Linux via CUDA
First, verify your GPU is on the list of supported CUDA GPUs.
Then, install the CUDA Toolkit for your appropriate distro. Once that is done, you can build llama.cpp with the following:
make clean # if you already built it
make LLAMA_CUBLAS=1
Linux via OpenCL
If you aren’t running a Nvidia GPU, fear not! GGML (the library behind llama.cpp) has support for acceleration via CLBlast, meaning that any GPU that supports OpenCL will also work (this includes most AMD GPUs and some Intel integrated graphics chips). See the OpenCL GPU database for a full list.
First, install the OpenCL SDK and CLBlast. Once that is done, you can build with the following command:
make clean # if you already built it
make LLAMA_CLBLAST=1
Getting Weights
Meta did not officially release GGML weights for Llama 2, however a community member, TheBlokeAI, released GGML formatted weights on his HuggingFace page. Here is all the ones he released.
No 70B parameter GGML model weights are available yet, however 7B and 13B are more than enough to experiment with!
Weight Types
You’ll notice that the files for those models have a lot of options, all ending in .bin
with things like .q4_0
and q3_K_M
thrown in. Those are the different quantization methods available for the models. Quantization is the process of reducing the number of bits used by the models, reducing size and memory use. You should experiment with each one and figure out which fits your use case the best, but for my demo above I used llama-2-13b-chat.ggmlv3.q4_1.bin
.
Running the Models
Once you have the weights downloaded, you should move them near the llama.cpp directory. I used a models
folder within the llama.cpp repo. For example, assuming you are already in the llama.cpp repo:
mkdir models
cd models
wget https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/llama-2-13b-chat.ggmlv3.q4_1.bin
cd ..
Once that is complete, you can run the model on CPU with the following:
./main -t 10 -m ./models/llama-2-13b-chat.ggmlv3.q4_1.bin --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:"
You should change 10
to the number of physical cores you system has. For example, if you have a 8 core system with 16 threads, you should set the number to 8.
There will be a warning that pops up saying that the model doesn’t support more than 2048 tokens, however that is incorrect and will probably be fixed in a future version of llama.cpp. Llama 2 supports contexts of up to 4096 tokens, the same as GPT-3 and GPT-3.5.
Running with GPU Acceleration
MacOS via Metal
./main -ngl 1 -n 128 -m ./models/llama-2-13b-chat.ggmlv3.q4_1.bin --color -c 500 -b 192 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write me a Python program that takes in user input and greets the user with their name.\n### Response:"
Notice that I changed the number after -c
from 4096
to 500
. I was running out of memory running on my Mac’s GPU, decreasing context size is the easiest way to decrease memory use.
Linux via CUDA
If you want to fully offload to GPU, set the -ngl
value to an extremely high number.
./main -ngl 15000 -n 128 -m ./models/llama-2-13b-chat.ggmlv3.q4_1.bin --color -c 500 -b 192 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write me a Python program that takes in user input and greets the user with their name.\n### Response:"
You can experiment with much lower numbers and increase until your GPU runs out of VRAM.
Linux via OpenCL
The only difference between running the CUDA and OpenCL versions is that when using the OpenCL versions you have to set platform and/or devices at runtime. Here are some examples.
GGML_OPENCL_PLATFORM=1 ./main ... # everything after ./main is the same as CUDA
GGML_OPENCL_DEVICE=2 ./main ...
GGML_OPENCL_PLATFORM=Intel ./main ...
GGML_OPENCL_PLATFORM=AMD GGML_OPENCL_DEVICE=1 ./main ...
Running Interactively
You can run any models show in a ChatGPT-like interactive mode right from within your terminal! Here is how to do it.
Windows
# assuming you are in the llama.cpp repo
set MODEL="path/to/model.bin"
.\examples\chat-13B.bat
Linux/MacOS
MODEL="path/to/model.bin" ./examples/chat-13B.sh
Conclusion
Llama 2 is an exciting step forward in the world of open source AI and LLMs. We've covered everything from obtaining the model, building the engine with or without GPU acceleration, to running the models interactively. This guide should provide you with a solid foundation to explore and experiment with Llama 2, whether you're a hobbyist, a researcher, or a business looking to leverage the power of AI.
However, we understand that implementing AI solutions can be a complex task, especially when it comes to integrating them into existing workflows or products. That's where we, at TimeSurge Labs, come in. We specialize in AI consulting, development, internal tooling, and LLM hosting. Our mission is to handle AI so you can focus on your business. We offer bespoke integration services, working with you to integrate our AI into your existing workflow or products. Whether you prefer fully local, hybrid, or cloud-based AI solutions, we've got you covered.
Our products, such as Searchbase and OttoDocs, are designed to increase productivity and customer satisfaction. We also offer paid support plans for custom integrations, additional features, and support. Our team of passionate AI experts is dedicated to building the future of AI and helping your business thrive in this rapidly changing industry.
If your company needs AI consulting, contracting, or education, don't hesitate to reach out to us. Let's explore how we can help you find your AI workflow. Contact us today at TimeSurge Labs!
Top comments (2)
The 70B model is available now, but you need 8 GPU cards to run it. On an AWS g4dn.metal instance you have that (plus 96 CPU cores). However, with monitoring nvidia-smi I see that my GPUs are only 35% utilized with the 70B model (and less with the smaller models). I also notice that if leaving -t unspecified it uses 96 threads and this actually slows things down drastically. I found that -t 4 is about as good as it gets, leaving me with 92 CPU cores that I can't use but pay dearly for! Any idea how we can use the resources more fully or what causes the apparent contention with the high CPU thread count? And why we can't fully use the GPUs to at least 80%?
If you are running at that scale, it may be better to just use HuggingFace Transformers with all the optimizations (cross GPU inference via Optimum) or host an OpenAI Compatible server across all 8 GPUs with vLLM.
Llama.cpp is more about running LLMs on machines that otherwise couldn't due to CPU limitations, lack of memory, GPU limitations, or a combination of any limitations. If you are able to afford a machine with 8 GPUs and are going to be running it at scale, using vLLM or cross GPU inference via Transformers and Optimum are your best options.