It’s an inference server optimized (uses paged attention) that supports models like Llama 3, Mistral, Gemma, Phi, Qwen, etc.
This offers an OpenAI-compatible API, perfect for easy integration.
We will create the Docker Compose that allows us to deploy it:
File: docker-compose.yml
version: "3.9" services: vllm: image: vllm/vllm-openai:latest container_name: vllm restart: unless-stopped ports: - "8000:8000" # OpenAI API port environment: - MODEL_NAME=meta-llama/Meta-Llama-3-8B-Instruct # Change to any available model on Hugging Face, for example: # - mistralai/Mistral-7B-Instruct-v0.3 # - meta-llama/Meta-Llama-3-8B # - TheBloke/Llama-2-7B-Chat-GGUF (if you use quantized variants) # - huggingface_token=your_token (optional if the model is private) volumes: - ./models:/root/.cache/huggingface/hub # local cache of models command: > --model $(MODEL_NAME) --port 8000 --host 0.0.0.0 --max-num-batched-tokens 4096 --tensor-parallel-size 1 --gpu-memory-utilization 0.90
How to use it
- Save the file as
docker-compose.yml. - Making sure Docker and Docker Compose are installed.
- Evaluate:
docker compose up -d - The container will download the model automatically from Hugging Face (it may take a while for the first time).
- Once started, you will have an API at:
http://localhost:8000/v1Compatible with the OpenAI format (you can usecurl, Postman, LangChain, n8n, OpenWebUI…).
Test it
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "user", "content": "Hello, who are you?"}] }'
You should receive a JSON response from the model.
If we want to access the GPU, we will have to use the following Docker Compose:
services: vllm: image: vllm/vllm-openai:latest container_name: vllm restart: unless-stopped # Important: Configure the NVIDIA runtime for GPU access runtime: nvidia ports: - "8000:8000" # OpenAI API port environment: - MODEL_NAME=meta-llama/Llama-3.2-3B # Change to any available model on Hugging Face, for example: # - mistralai/Mistral-7B-Instruct-v0.3 # - meta-llama/Meta-Llama-3-8B # - meta-llama/Meta-Llama-3-70B # - meta-llama/Llama-3.2-3B # - Qwen/Qwen3-0.6B # - TheBloke/Llama-2-7B-Chat-GGUF (if using quantized variants) # - huggingface_token=tu_token (optional if the model is private) # - OPENAI_API_KEY=${OPENAI_API_KEY} # - HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN} volumes: - ./model:/root/.cache/huggingface command: > --model meta-llama/Llama-3.2-3B --port 8000 --host 0.0.0.0 --max-num-batched-tokens 4096 --tensor-parallel-size 1 --gpu-memory-utilization 0.80 --dtype float16 --max-model-len 4096 deploy: resources: reservations: devices: - capabilities: [gpu] ipc: host networks: - docker-network networks: docker-network: driver: bridge external: true
