vLLM training

Accelerating LLM Inference with vLLM: A Hands-on Guide

Indice dei contenuti

Large Language Models (LLMs) have revolutionized AI applications, but deploying them efficiently for inference remains challenging. This guide demonstrates how to use vLLM, an open-source library for high-throughput LLM inference, on cloud GPU servers to dramatically improve inference performance and resource utilization.

What is vLLM?

vLLM is a high-performance library for LLM inference and serving that offers:

  • Efficient Memory Management: Uses PagedAttention for optimized KV cache management
  • Distributed Inference: Supports tensor parallelism and pipeline parallelism across multiple GPUs and nodes
  • Continuous Batching: Processes requests dynamically for higher throughput
  • OpenAI-Compatible API: Offers drop-in replacement for OpenAI API users
  • Multiple Model Formats: Supports a wide range of model architectures and quantization methods.


In this guide, we’ll explore how to deploy vLLM on GPU servers for optimal performance.

Getting Started

Step 1: Provision a GPU Server

Before diving into vLLM, we need to provision a suitable GPU server. For this demonstration, we’ll use a server with the following specifications:

  1. Provision a GPU server with NVIDIA drivers pre-installed.
  2. Connect to your server:

$ ssh -i ~/.ssh/your_key.pem user@your_server_ip

  1. Verify your GPU is correctly recognized:

$ nvidia-smi

This command should display your GPU model, memory capacity, and driver version.

Step 2: Choose Your Deployment Method

You can deploy vLLM using either Docker (recommended for simplicity) or a direct installation. Let’s explore both options:

Option A: Using Docker (Recommended)

Docker provides a simple way to deploy vLLM without worrying about dependencies or environment setup. First, you’ll need to install and configure the NVIDIA Container Toolkit:

# Install NVIDIA Container Toolkit
sudo apt-get install -y nvidia-container-toolkit

# Configure Docker to use NVIDIA runtime
sudo nvidia-ctk runtime configure --runtime=docker

# Restart Docker service
sudo systemctl restart docker

Now you can run vLLM using the official Docker image that comes with everything pre-installed:

# Run vLLM with a Qwen model using Docker
docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model Qwen/Qwen2.5-14B-Instruct

This command:

  • Mounts your local HuggingFace cache to avoid redundant downloads

  • Exposes port 8000 for the API server

  • Uses --ipc=host for sharing memory between processes (required for tensor parallelism)

    • Loads the Qwen2.5-14B-Instruct model.

Option B: Direct Installation

Alternatively, you can install vLLM directly:

# Install uv package manager
curl-LsSfhttps://astral.sh/uv/install.sh|sh

# Create a new virtual environment with Python 3.12
ux venx mxenx --Rxthon 3.12 --seed
source myeny/bin/activate

#Install uLM
uy pip install vllm

Running Inference with vLLM

vLLM offers two primary modes of operation:

  • Offline Inference: For batch processing of prompts
  • Online Serving: For real-time API serving
  • Let’s explore both approaches.

Offline Batch Inference

For batch processing tasks, we can use vLLM’s Python API:

from vllm import LLM, SamplingParams

# Initialize the model
llm = LLM(model="Qwen/Qwen2.5-14B-Instruct")

# Define sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=512
)

# List of prompts to process
prompts = [
"Write a short poem about artificial intelligence.",
"Explain quantum computing in simple terms.",
"What are three ways to improve cloud infrastructure?",
"Summarize the history of machine learning in two paragraphs."
]

# Generate outputs
outputs = llm.generate(prompts, sampling_params)

# Process results
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt}")
print(f"Generated text: {generated_text}")
print("-" * 50)

Online Serving with OpenAI-Compatible API

vLLM includes an OpenAI-compatible API server, making it a drop-in replacement for applications using OpenAI’s API:

# If using direct installation:
vllm serve Qwen/Qwen2.5-14B-Instruct --dtype auto

# If using Docker:
docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host \
vllm/vllm-openai:latest --model Qwen/Qwen2.5-14B-Instruct --dtype auto

Once the server is running, you can interact with it using any HTTP client:

import requests

response = requests.post(
"http://localhost:8000/v1/completions",
json={
"model": "Qwen/Qwen2.5-14B-Instruct",
"prompt": "Write a function in Python to check if a string is a palindrome.",
"max_tokens": 250,
"temperature": 0.7
}
)

print(response.json())

Or use the OpenAI Python client:

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # Unless you've set an API key
)

completion = client.chat.completions.create(
model="Qwen/Qwen2.5-14B-Instruct",
messages=[
{"role": "user", "content": "Explain how to implement a binary search algorithm."}
]
)

print(completion.choices[0].message.content)

Optimizing Performance

Choosing the Right Data Type

vLLM supports various precision formats. Lower precision reduces memory usage but may slightly impact quality:

# Using direct installation:
# Use bfloat16 precision (good balance between accuracy and memory)
vllm serve Qwen/Qwen2.5-14B-Instruct --dtype bfloat16

# For maximum memory efficiency (may reduce quality)
vllm serve Qwen/Qwen2.5-14B-Instruct --dtype float16

# Using Docker:
docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host \
vllm/vllm-openai:latest --model Qwen/Qwen2.5-14B-Instruct --dtype bfloat16

Scaling Across Multiple GPUs

For larger models or higher throughput, vLLM can leverage multiple GPUs using tensor parallelism:

# Using direct installation:
vllm serve Qwen/Qwen2.5-72B-Instruct --tensor-parallel-size 4 --dtype bfloat16

# Using Docker:
docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host \
vllm/vllm-openai:latest --model Qwen/Qwen2.5-72B-Instruct \
--tensor-parallel-size 4 --dtype bfloat16

Enabling Quantization

Quantization can significantly reduce memory usage with minimal quality impact:

# Using direct installation:
# Using GPTQ Int8 quantization
vllm serve Qwen/Qwen2.5-14B-Instruct-GPTQ-Int8 --quantization gptq

# Using AWQ quantization
vllm serve Qwen/Qwen2.5-14B-Instruct-AWQ --quantization awq

# Using Docker:
docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host \
vllm/vllm-openai:latest --model Qwen/Qwen2.5-14B-Instruct-GPTQ-Int8 \
--quantization gptq

Adjusting Memory Usage

Control memory usage with these parameters:

# Using direct installation:
# Set maximum model context length
vllm serve Qwen/Qwen2.5-14B-Instruct --max-model-len 4096

# Control GPU memory utilization (range 0-1)
vllm serve Qwen/Qwen2.5-14B-Instruct --gpu-memory-utilization 0.85

# Enable CPU offloading for larger models
vllm serve Qwen/Qwen2.5-72B-Instruct --tensor-parallel-size 2 --cpu-offload-gb 20

# Using Docker (example with combined parameters):
docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host \
vllm/vllm-openai:latest --model Qwen/Qwen2.5-72B-Instruct \
--tensor-parallel-size 2 --cpu-offload-gb 20 --max-model-len 4096

Multi-GPU Configuration

For larger models or higher throughput requirements, vLLM supports multi-GPU deployment using tensor parallelism.

You can run vLLM across multiple GPUs on a single machine:

# Using direct installation:
vllm serve Qwen/Qwen2.5-72B-Instruct \
--tensor-parallel-size 4 \
--dtype bfloat16

# Using Docker:
docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host \
vllm/vllm-openai:latest --model Qwen/Qwen2.5-72B-Instruct \
--tensor-parallel-size 4 --dtype bfloat16

This configuration splits the model across 4 GPUs using tensor parallelism, allowing you to load larger models or increase throughput.

Monitoring and Metrics

vLLM provides built-in metrics for monitoring performance:

# Access metrics endpoint when server is running

curl http://localhost:8000/metrics

Key metrics tovllm:prompt_tokens_total and vllm:generation_tokens_total: Token throughput

vllm:time_per_output_token_seconds: Decoding speed

vllm:gpu_cache_usage_perc: GPU memory utilization

vllm:e2e_request_latency_seconds: End-to-end latency

Best Practices for Production Deployment

When deploying vLLM in production, consider these best practices:

  1. Pre-download Models: Download models before serving to avoid startup delays
python -c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen2.5-14B-Instruct')"
  1. Mount HuggingFace Cache in Docker: Avoid re-downloading models by mounting your cache
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 --ipc=host \
vllm/vllm-openai:latest --model Qwen/Qwen2.5-14B-Instruct
  1. Implement Health Checks: Configure a health check endpoint that your load balancer can use
curl http://localhost:8000/health
  1. Set Proper Request Timeouts: Configure your API gateway with appropriate timeouts for longer generations
  2. Enable API Keys: Add basic authentication
# Using direct installation:
vllm serve Qwen/Qwen2.5-14B-Instruct --api-key your_secret_key

# Using Docker:
docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host \
vllm/vllm-openai:latest --model Qwen/Qwen2.5-14B-Instruct \
--api-key your_secret_key

Conclusion

vLLM provides a powerful, efficient solution for LLM inference and serving. By following this guide, you can deploy high-performance LLM services on GPU cloud infrastructure with optimized resource utilization and throughput.

Whether you’re serving a smaller 14B parameter model or a massive 72B parameter model across multiple GPUs, vLLM’s architecture enables you to get the most out of your hardware while maintaining compatibility with the OpenAI API ecosystem.

For more advanced use cases and detailed documentation, visit the official vLLM documentation at docs.vllm.ai.

CONDIVIDI SUI SOCIAL

Lascia un commento

Il tuo indirizzo email non sarà pubblicato. I campi obbligatori sono contrassegnati *

+ 39 = 41