Using VLLM with Docker for Deploying Our LLM Models in Production

Tiempo de lectura: 2 minutos

It’s an inference server optimized (uses paged attention) that supports models like Llama 3, Mistral, Gemma, Phi, Qwen, etc.

Lago atardecer - pexels

This offers an OpenAI-compatible API, perfect for easy integration.

We will create the Docker Compose that allows us to deploy it:

File: docker-compose.yml

version: "3.9" services: vllm: image: vllm/vllm-openai:latest container_name: vllm restart: unless-stopped ports: - "8000:8000" # OpenAI API port environment: - MODEL_NAME=meta-llama/Meta-Llama-3-8B-Instruct # Change to any available model on Hugging Face, for example: # - mistralai/Mistral-7B-Instruct-v0.3 # - meta-llama/Meta-Llama-3-8B # - TheBloke/Llama-2-7B-Chat-GGUF (if you use quantized variants) # - huggingface_token=your_token (optional if the model is private) volumes: - ./models:/root/.cache/huggingface/hub # local cache of models command: > --model $(MODEL_NAME) --port 8000 --host 0.0.0.0 --max-num-batched-tokens 4096 --tensor-parallel-size 1 --gpu-memory-utilization 0.90 

How to use it

  1. Save the file as docker-compose.yml.
  2. Making sure Docker and Docker Compose are installed.
  3. Evaluate: docker compose up -d
  4. The container will download the model automatically from Hugging Face (it may take a while for the first time).
  5. Once started, you will have an API at: http://localhost:8000/v1 Compatible with the OpenAI format (you can use curl, Postman, LangChain, n8n, OpenWebUI…).

Test it

curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "user", "content": "Hello, who are you?"}] }' 

You should receive a JSON response from the model.

If we want to access the GPU, we will have to use the following Docker Compose:

services: vllm: image: vllm/vllm-openai:latest container_name: vllm restart: unless-stopped # Important: Configure the NVIDIA runtime for GPU access runtime: nvidia ports: - "8000:8000" # OpenAI API port environment: - MODEL_NAME=meta-llama/Llama-3.2-3B # Change to any available model on Hugging Face, for example: # - mistralai/Mistral-7B-Instruct-v0.3 # - meta-llama/Meta-Llama-3-8B # - meta-llama/Meta-Llama-3-70B # - meta-llama/Llama-3.2-3B # - Qwen/Qwen3-0.6B # - TheBloke/Llama-2-7B-Chat-GGUF (if using quantized variants) # - huggingface_token=tu_token (optional if the model is private) # - OPENAI_API_KEY=${OPENAI_API_KEY} # - HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN} volumes: - ./model:/root/.cache/huggingface command: > --model meta-llama/Llama-3.2-3B --port 8000 --host 0.0.0.0 --max-num-batched-tokens 4096 --tensor-parallel-size 1 --gpu-memory-utilization 0.80 --dtype float16 --max-model-len 4096 deploy: resources: reservations: devices: - capabilities: [gpu] ipc: host networks: - docker-network networks: docker-network: driver: bridge external: true 

Leave a Comment