Large Language Models (LLMs) have revolutionized AI applications, but deploying them efficiently for inference remains challenging. This guide demonstrates how to use vLLM, an open-source library for high-throughput LLM inference, on cloud GPU servers to dramatically improve inference performance and resource utilization.
What is vLLM?
vLLM is a high-performance library for LLM inference and serving that offers:
- Efficient Memory Management: Uses PagedAttention for optimized KV cache management
- Distributed Inference: Supports tensor parallelism and pipeline parallelism across multiple GPUs and nodes
- Continuous Batching: Processes requests dynamically for higher throughput
- OpenAI-Compatible API: Offers drop-in replacement for OpenAI API users
- Multiple Model Formats: Supports a wide range of model architectures and quantization methods.
In this guide, we’ll explore how to deploy vLLM on GPU servers for optimal performance.
Getting Started
Step 1: Provision a GPU Server
Before diving into vLLM, we need to provision a suitable GPU server. For this demonstration, we’ll use a server with the following specifications:
- Provision a GPU server with NVIDIA drivers pre-installed.
- Connect to your server:
$ ssh -i ~/.ssh/your_key.pem user@your_server_ip
- Verify your GPU is correctly recognized:
$ nvidia-smi
This command should display your GPU model, memory capacity, and driver version.
Step 2: Choose Your Deployment Method
You can deploy vLLM using either Docker (recommended for simplicity) or a direct installation. Let’s explore both options:
Option A: Using Docker (Recommended)
Docker provides a simple way to deploy vLLM without worrying about dependencies or environment setup. First, you’ll need to install and configure the NVIDIA Container Toolkit:
# Install NVIDIA Container Toolkit
sudo apt-get install -y nvidia-container-toolkit
# Configure Docker to use NVIDIA runtime
sudo nvidia-ctk runtime configure --runtime=docker
# Restart Docker service
sudo systemctl restart docker
Now you can run vLLM using the official Docker image that comes with everything pre-installed:
# Run vLLM with a Qwen model using Docker
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model Qwen/Qwen2.5-14B-Instruct
This command:
- Mounts your local HuggingFace cache to avoid redundant downloads
- Exposes port 8000 for the API server
- Uses
--ipc=hostfor sharing memory between processes (required for tensor parallelism)
-
- Loads the Qwen2.5-14B-Instruct model.
Option B: Direct Installation
Alternatively, you can install vLLM directly:
# Install uv package manager
curl-LsSfhttps://astral.sh/uv/install.sh|sh
# Create a new virtual environment with Python 3.12
ux venx mxenx --Rxthon 3.12 --seed
source myeny/bin/activate
#Install uLM
uy pip install vllm
Running Inference with vLLM
vLLM offers two primary modes of operation:
- Offline Inference: For batch processing of prompts
- Online Serving: For real-time API serving
- Let’s explore both approaches.
Offline Batch Inference
For batch processing tasks, we can use vLLM’s Python API:
from vllm import LLM, SamplingParams
# Initialize the model
llm = LLM(model="Qwen/Qwen2.5-14B-Instruct")
# Define sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=512
)
# List of prompts to process
prompts = [
"Write a short poem about artificial intelligence.",
"Explain quantum computing in simple terms.",
"What are three ways to improve cloud infrastructure?",
"Summarize the history of machine learning in two paragraphs."
]
# Generate outputs
outputs = llm.generate(prompts, sampling_params)
# Process results
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt}")
print(f"Generated text: {generated_text}")
print("-" * 50)
Online Serving with OpenAI-Compatible API
vLLM includes an OpenAI-compatible API server, making it a drop-in replacement for applications using OpenAI’s API:
# If using direct installation:
vllm serve Qwen/Qwen2.5-14B-Instruct --dtype auto
# If using Docker:
docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host \
vllm/vllm-openai:latest --model Qwen/Qwen2.5-14B-Instruct --dtype auto
Once the server is running, you can interact with it using any HTTP client:
import requests
response = requests.post(
"http://localhost:8000/v1/completions",
json={
"model": "Qwen/Qwen2.5-14B-Instruct",
"prompt": "Write a function in Python to check if a string is a palindrome.",
"max_tokens": 250,
"temperature": 0.7
}
)
print(response.json())
Or use the OpenAI Python client:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # Unless you've set an API key
)
completion = client.chat.completions.create(
model="Qwen/Qwen2.5-14B-Instruct",
messages=[
{"role": "user", "content": "Explain how to implement a binary search algorithm."}
]
)
print(completion.choices[0].message.content)
Optimizing Performance
Choosing the Right Data Type
vLLM supports various precision formats. Lower precision reduces memory usage but may slightly impact quality:
# Using direct installation:
# Use bfloat16 precision (good balance between accuracy and memory)
vllm serve Qwen/Qwen2.5-14B-Instruct --dtype bfloat16
# For maximum memory efficiency (may reduce quality)
vllm serve Qwen/Qwen2.5-14B-Instruct --dtype float16
# Using Docker:
docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host \
vllm/vllm-openai:latest --model Qwen/Qwen2.5-14B-Instruct --dtype bfloat16
Scaling Across Multiple GPUs
For larger models or higher throughput, vLLM can leverage multiple GPUs using tensor parallelism:
# Using direct installation:
vllm serve Qwen/Qwen2.5-72B-Instruct --tensor-parallel-size 4 --dtype bfloat16
# Using Docker:
docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host \
vllm/vllm-openai:latest --model Qwen/Qwen2.5-72B-Instruct \
--tensor-parallel-size 4 --dtype bfloat16
Enabling Quantization
Quantization can significantly reduce memory usage with minimal quality impact:
# Using direct installation:
# Using GPTQ Int8 quantization
vllm serve Qwen/Qwen2.5-14B-Instruct-GPTQ-Int8 --quantization gptq
# Using AWQ quantization
vllm serve Qwen/Qwen2.5-14B-Instruct-AWQ --quantization awq
# Using Docker:
docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host \
vllm/vllm-openai:latest --model Qwen/Qwen2.5-14B-Instruct-GPTQ-Int8 \
--quantization gptq
Adjusting Memory Usage
Control memory usage with these parameters:
# Using direct installation:
# Set maximum model context length
vllm serve Qwen/Qwen2.5-14B-Instruct --max-model-len 4096
# Control GPU memory utilization (range 0-1)
vllm serve Qwen/Qwen2.5-14B-Instruct --gpu-memory-utilization 0.85
# Enable CPU offloading for larger models
vllm serve Qwen/Qwen2.5-72B-Instruct --tensor-parallel-size 2 --cpu-offload-gb 20
# Using Docker (example with combined parameters):
docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host \
vllm/vllm-openai:latest --model Qwen/Qwen2.5-72B-Instruct \
--tensor-parallel-size 2 --cpu-offload-gb 20 --max-model-len 4096
Multi-GPU Configuration
For larger models or higher throughput requirements, vLLM supports multi-GPU deployment using tensor parallelism.
You can run vLLM across multiple GPUs on a single machine:
# Using direct installation:
vllm serve Qwen/Qwen2.5-72B-Instruct \
--tensor-parallel-size 4 \
--dtype bfloat16
# Using Docker:
docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host \
vllm/vllm-openai:latest --model Qwen/Qwen2.5-72B-Instruct \
--tensor-parallel-size 4 --dtype bfloat16
This configuration splits the model across 4 GPUs using tensor parallelism, allowing you to load larger models or increase throughput.
Monitoring and Metrics
vLLM provides built-in metrics for monitoring performance:
# Access metrics endpoint when server is runningcurlhttp://localhost:8000/metrics
Key metrics tovllm:prompt_tokens_total and vllm:generation_tokens_total: Token throughput
vllm:time_per_output_token_seconds: Decoding speed
vllm:gpu_cache_usage_perc: GPU memory utilization
vllm:e2e_request_latency_seconds: End-to-end latency
Best Practices for Production Deployment
When deploying vLLM in production, consider these best practices:
- Pre-download Models: Download models before serving to avoid startup delays
python -c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen2.5-14B-Instruct')"
- Mount HuggingFace Cache in Docker: Avoid re-downloading models by mounting your cache
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 --ipc=host \
vllm/vllm-openai:latest --model Qwen/Qwen2.5-14B-Instruct
- Implement Health Checks: Configure a health check endpoint that your load balancer can use
curl http://localhost:8000/health
- Set Proper Request Timeouts: Configure your API gateway with appropriate timeouts for longer generations
- Enable API Keys: Add basic authentication
# Using direct installation:
vllm serve Qwen/Qwen2.5-14B-Instruct --api-key your_secret_key
# Using Docker:
docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host \
vllm/vllm-openai:latest --model Qwen/Qwen2.5-14B-Instruct \
--api-key your_secret_key
Conclusion
vLLM provides a powerful, efficient solution for LLM inference and serving. By following this guide, you can deploy high-performance LLM services on GPU cloud infrastructure with optimized resource utilization and throughput.
Whether you’re serving a smaller 14B parameter model or a massive 72B parameter model across multiple GPUs, vLLM’s architecture enables you to get the most out of your hardware while maintaining compatibility with the OpenAI API ecosystem.
For more advanced use cases and detailed documentation, visit the official vLLM documentation at docs.vllm.ai.
Articoli correlati:
- GPU Memory Slicing: Running Multiple AI Models with NVIDIA MIG
- Complete Guide to Deploying DeepSeek-R1 on AMD MI300X GPUs + Open WebUI: Enterprise AI Solution
- Getting Started with Ollama: A Hands-on Guide
- Model Serving tramite istanze on-demand di Regolo.ai
- AI Accelerator: disponibili i nuovi server con card Tenstorrent per l’AI Inference
- Cos’è un Inference Provider? Seeweb presenta Regolo.ai


