Hello, today we are going to learn how to deploy llama-2 on a server with an ARM64 environment, such as an Ampere 1 offered by Oracle Cloud. It’s worth noting that it’s also compatible with AMD64.
The program in question is called LlaMA C++ and is available for multiple environments:
- Python: abetlen/llama-cpp-python
- Go: go-skynet/go-llama.cpp
- Node.js: withcatai/node-llama-cpp, hlhr202/llama-node
- Ruby: yoshoku/llama_cpp.rb
- Rust: mdrokz/rust-llama.cpp
- C#/.NET: SciSharp/LLamaSharp
- Scala 3: donderom/llm4s
- Clojure: phronmophobic/llama.clj
- React Native: mybigday/llama.rn
- Java: kherud/java-llama.cpp
In our case, we are going to use the Python one: https://github.com/abetlen/llama-cpp-python
First, we have generated a Docker Compose with the following configuration:
Folders:
- app
- config
- models
- Dockerfile
With this folder structure created, let’s add the file in the root called docker-compose.yml with this content:
version: "3.1" services: llama2: build: context: ./Dockerfile dockerfile: container_python restart: unless-stopped container_name: llama2 working_dir: /app volumes: - ./app:/app/code - ./config/models:/app/models
Once created, we will add the file called container_python inside the
Once created, we will add the file named `container_python` to the `Dockerfile` folder with the following content:
# syntax=docker/dockerfile:1 FROM python:3.11.3 RUN apt-get update -y RUN apt install python3-pip -y RUN pip install --upgrade pip RUN pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir # Run a command to keep the container running CMD tail -f /dev/null
We install a Python 3.11.3 environment and the `llama-cpp-python` library with the options to force-reinstall when running the container.
Now, let’s create the Python script in the `app` directory and name it `code.py`:
from llama_cpp import Llama llm = Llama(model_path="./../models/7b.Q4_K_S.gguf") output = llm("Q: What is a dog? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True) print(output)
In my case, I am specifying the model `7b.Q4_K_S.gguf` that I have downloaded in the next step. I’m asking, “What is a dog?”
You can also modify it like this:output = llm("What is a dog?")
Now, let’s download this model. Note that with the latest update of `llama-2`, it only supports `.gguf` models. You can download compatible models from the official website: https://huggingface.co
Here’s the model I used: https://huggingface.co/TheBloke/Llama-2-Coder-7B-GGUF/tree/main
Specifically, the model is `llama-2-coder-7b.Q4_K_S.gguf`. Download it and add it to the `config/models` folder. Make sure to rename it to `7b.Q4_K_S.gguf` so that it loads correctly.
Finally, the folder structure looks like this:app
code.py
config
models
7b.Q4_K_S.gguf
Dockerfile
container_python
docker-compose.yml
Now, you can run your container:
docker-compose up -d
Once it’s up, the container won’t do anything. You have to access it and run the code manually.
To do that, you can use the following commands:docker exec -it llama2 /bin/bash
cd code
python code.py
It takes some time. In my case, I received the following result:
It responded with a dog emoji.
If you don’t like this model, you can search for other compatible models and try them.
Here’s a list of all compatible models: https://github.com/ggerganov/llama.cpp