What is MIG?
MIG creates hardware-isolated “virtual GPUs” within one physical card:
- Dedicated memory slices with guaranteed bandwidth
- Independent compute cores (SMs) per slice
- Zero interference between slices
- Quality of Service guarantees
The Problem: A 4GB model on an 80GB GPU wastes 95% of memory
The Solution: Split into multiple slices, run multiple models in parallel
GPU Compatibility
✅ Supported: NVIDIA data center GPUs with Ampere+ architecture (A100, A30, H100, etc.)
❌ Not Supported: Consumer GPUs (GeForce RTX series) or older architectures
Quick Setup Guide
Step 1: Enable MIG Mode
Check if your GPU supports MIG:
nvidia-smi # Look for "MIG M." column in output
Enable persistence mode (keeps GPU driver loaded between jobs):
sudo nvidia-smi -pm 1
Enable MIG mode (performs GPU reset, kills any running GPU processes):
sudo nvidia-smi -i 0 -mig 1 # -i 0 targets GPU 0, -mig 1 enables MIG
Step 2: View Available Memory Slices
List all available GPU instance profiles (shows slice sizes and placement options):
sudo nvidia-smi mig -lgipp # -lgipp = List GPU Instance Profile IDs
Actual output from A100 80GB:
GPU 0 Profile ID 19 Placements: {0,1,2,3,4,5,6}:1
GPU 0 Profile ID 20 Placements: {0,1,2,3,4,5,6}:1
GPU 0 Profile ID 15 Placements: {0,2,4,6}:2
GPU 0 Profile ID 14 Placements: {0,2,4}:2
GPU 0 Profile ID 9 Placements: {0,4}:4
GPU 0 Profile ID 5 Placement : {0}:4
GPU 0 Profile ID 0 Placement : {0}:8
Profile meanings:
- Profile 19/20: 1g.10gb (~10GB slice) – Most flexible placement
- Profile 15/14: 2g.20gb (~20GB slice) – Medium slices
- Profile 9/5: 4g.40gb (~40GB slice) – Large slices
- Profile 0: 7g.80gb (~80GB) – Full GPU as single slice
Placement rules: {0,1,2,3,4,5,6}:1 means can start at positions 0-6, uses 1 GPU slice
Step 3: Create Memory Slices
Create two 20GB memory slices (uses profile 15 for 2g.20gb instances):
sudo nvidia-smi mig -i 0 -cgi 15,15 -C
# -cgi = Create GPU Instances, 15,15 = two instances with profile 15, -C = commit changes
Verify slices were created (lists all GPU devices including MIG instances):
nvidia-smi -L # -L lists all GPU devices with UUIDs
Output:
GPU 0: NVIDIA A100-SXM4-80GB
MIG 0: 2g.20gb (UUID: MIG-aaaa-1111-bbbb-2222)
MIG 1: 2g.20gb (UUID: MIG-cccc-3333-dddd-4444)
Running Multiple Models on Different Slices
Step 4: Deploy Models to Each Slice
Extract MIG UUIDs (get the unique identifiers for each slice):
# Get MIG slice UUIDs from nvidia-smi -L output
SLICE0="MIG-aaaa-1111-bbbb-2222" # First 20GB slice UUID
SLICE1="MIG-cccc-3333-dddd-4444" # Second 20GB slice UUID
Run first model on slice 0 (binds container exclusively to first MIG instance):
docker run -d \ --gpus "device=$SLICE0" \ # Bind to specific MIG slice by UUID --name model-slice-0 \ # Container name for management -p 8000:8000 \ # Expose API on port 8000 vllm/vllm-openai:latest \ # High-performance inference engine --model Qwen/Qwen3-4B # Load the 4B parameter model
Run second model on slice 1 (independent instance on second MIG slice):
docker run -d \ --gpus "device=$SLICE1" \ # Bind to second MIG slice --name model-slice-1 \ # Different container name -p 8001:8000 \ # Different host port to avoid conflicts vllm/vllm-openai:latest \ --model Qwen/Qwen3-4B
Step 5: Test Both Models
Test first model (sends request to model running on slice 0):
-X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen3-4B", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 50}'
Test second model simultaneously (sends request to model on slice 1):
-X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen3-4B", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 50}'
Monitor Memory Usage
View real-time GPU utilization (shows memory usage per MIG slice):
nvidia-smi # Updates display, shows each slice independently
Example output:
| 0 0 N/A 5120MiB / 20480MiB | 15% | # Slice 0: ~5GB used
| 0 1 N/A 5120MiB / 20480MiB | 12% | # Slice 1: ~5GB used
Memory Slice Configurations
Small Models (Multiple 10GB Slices)
Create four 10GB slices (uses profile 19 for smaller memory footprints):
sudo nvidia-smi mig -i 0 -cgi 19,19,19,19 -C # Creates 4x 1g.10gb instances
Deploy different models to each slice (runs 4 independent models):
for i in {0..3}; do
docker run -d --gpus "device=MIG-uuid-$i" --name "model-$i" \
-p "$((8000+i)):8000" vllm/vllm-openai:latest --model Qwen/Qwen3-4B
done
# Each model gets its own port (8000, 8001, 8002, 8003)
Mixed Workloads
Create asymmetric slices (1 large + 2 small for different workload types):
sudo nvidia-smi mig -i 0 -cgi 9,19,19 -C # 1x 40GB + 2x 10GB slices
Deploy larger model on big slice, smaller models on remaining slices:
# Large model needs more memory for complex tasks
docker run -d --gpus "device=$LARGE_SLICE" \
vllm/vllm-openai:latest --model Qwen/Qwen3-14B
# Smaller models for quick inference tasks
docker run -d --gpus "device=$SMALL_SLICE1" \
vllm/vllm-openai:latest --model Qwen/Qwen3-4B
Key Benefits Achieved
Resource Utilization:
- Before: 1 model, 6% GPU utilization (5GB/80GB)
- After: 2 models, 12.5% utilization (10GB/80GB)
- Potential: Up to 8 small models on same hardware
Performance:
- Parallel Processing: Both models respond simultaneously
- Guaranteed Memory: Each slice gets dedicated 20GB
- No Interference: Heavy load on one slice doesn’t affect the other
Cost Efficiency:
- Single Setup: ~€1,380 per model per month when running a dedicated full A100 GPU. This configuration maximizes raw performance but comes at the highest cost per model.
- MIG Setup (2-way split): ~€690 per model per month when partitioning one A100 into 2 MIG instances. Each model gets half of the GPU, balancing efficiency and cost.
- Max Efficiency (multi-instance): ~€172 per model per month if you partition the A100 into 8 logical instances and allocate one model per slice. In practice, A100 officially supports up to 7 concurrent MIG instances, so this figure represents a theoretical upper bound for cost-sharing.
Quick Management Commands
List all current MIG slices (shows active instances with UUIDs):
nvidia-smi -L # Lists all GPU devices including MIG instances
Remove all MIG slices (destroys all instances, returns to single GPU):
sudo nvidia-smi mig -i 0 -dgi 0 # -dgi 0 = delete all GPU instances
Disable MIG completely (returns to full 80GB GPU mode):
sudo nvidia-smi -i 0 -mig 0 # -mig 0 disables MIG, requires GPU reset
Clean up Docker containers (stops and removes model containers):
docker stop model-slice-0 model-slice-1 # Stop running containers
docker rm model-slice-0 model-slice-1 # Remove stopped containers
Core Concept Summary
MIG transforms expensive GPU hardware into efficient multi-tenant infrastructure, enabling multiple AI models to share resources safely while maintaining performance guarantees.
Here is the list of the main features and advantages:
- MIG partitions GPU memory into isolated slices
- Each slice acts as independent GPU with dedicated resources
- Docker containers bind to specific slices using MIG UUIDs
- Multiple models run in parallel without interference
- Memory utilization increases dramatically from single-digit to potential 8x improvement
Do you want to experience the benefits of MIG and make the most of your infrastructure investment? Try our Cloud Server GPU with NVIDIA A100 graphics.
Articoli correlati:
- Accelerating LLM Inference with vLLM: A Hands-on Guide
- Getting Started with Ollama: A Hands-on Guide
- Complete Guide to Deploying DeepSeek-R1 on AMD MI300X GPUs + Open WebUI: Enterprise AI Solution
- Cloud Server High Memory: pronte le istanze a memoria elevata
- L4, la Gpu di Nvidia ideale per la business AI
- Cloud Server GPU: usi, dettagli tecnici e benchmark del servizio basato su NVIDIA QUADRO RTX6000


