GPU Memory Slicing: Running Multiple AI Models with NVIDIA MIG

GPU Memory Slicing: Running Multiple AI Models with NVIDIA MIG

NVIDIA Multi-Instance GPU (MIG) partitions a single GPU into isolated memory slices, each acting as an independent GPU. Seeweb offers MIG partitions on its Cloud Server GPU NVIDIA cluster. This guide demonstrates running two Qwen3-4B models simultaneously on different slices of an NVIDIA A100 80GB on Seeweb’s Cloud Server GPU.
Indice dei contenuti

What is MIG?

MIG creates hardware-isolated “virtual GPUs” within one physical card:

  • Dedicated memory slices with guaranteed bandwidth
  • Independent compute cores (SMs) per slice
  • Zero interference between slices
  • Quality of Service guarantees

The Problem: A 4GB model on an 80GB GPU wastes 95% of memory
The Solution: Split into multiple slices, run multiple models in parallel

GPU Compatibility

Supported: NVIDIA data center GPUs with Ampere+ architecture (A100, A30, H100, etc.)
Not Supported: Consumer GPUs (GeForce RTX series) or older architectures

Quick Setup Guide

Step 1: Enable MIG Mode

Check if your GPU supports MIG:

nvidia-smi  # Look for "MIG M." column in output

Enable persistence mode (keeps GPU driver loaded between jobs):

sudo nvidia-smi -pm 1

Enable MIG mode (performs GPU reset, kills any running GPU processes):

sudo nvidia-smi -i 0 -mig 1   # -i 0 targets GPU 0, -mig 1 enables MIG

Step 2: View Available Memory Slices

List all available GPU instance profiles (shows slice sizes and placement options):

sudo nvidia-smi mig -lgipp  # -lgipp = List GPU Instance Profile IDs

Actual output from A100 80GB:

GPU  0 Profile ID 19 Placements: {0,1,2,3,4,5,6}:1
GPU  0 Profile ID 20 Placements: {0,1,2,3,4,5,6}:1
GPU  0 Profile ID 15 Placements: {0,2,4,6}:2
GPU  0 Profile ID 14 Placements: {0,2,4}:2
GPU  0 Profile ID  9 Placements: {0,4}:4
GPU  0 Profile ID  5 Placement : {0}:4
GPU  0 Profile ID  0 Placement : {0}:8

Profile meanings:

  • Profile 19/20: 1g.10gb (~10GB slice) – Most flexible placement
  • Profile 15/14: 2g.20gb (~20GB slice) – Medium slices
  • Profile 9/5: 4g.40gb (~40GB slice) – Large slices
  • Profile 0: 7g.80gb (~80GB) – Full GPU as single slice

Placement rules: {0,1,2,3,4,5,6}:1 means can start at positions 0-6, uses 1 GPU slice

Step 3: Create Memory Slices

Create two 20GB memory slices (uses profile 15 for 2g.20gb instances):

sudo nvidia-smi mig -i 0 -cgi 15,15 -C
# -cgi = Create GPU Instances, 15,15 = two instances with profile 15, -C = commit changes

Verify slices were created (lists all GPU devices including MIG instances):

nvidia-smi -L  # -L lists all GPU devices with UUIDs

Output:

GPU 0: NVIDIA A100-SXM4-80GB
  MIG 0: 2g.20gb (UUID: MIG-aaaa-1111-bbbb-2222)
  MIG 1: 2g.20gb (UUID: MIG-cccc-3333-dddd-4444)

Running Multiple Models on Different Slices

Step 4: Deploy Models to Each Slice

Extract MIG UUIDs (get the unique identifiers for each slice):

# Get MIG slice UUIDs from nvidia-smi -L output
SLICE0="MIG-aaaa-1111-bbbb-2222"  # First 20GB slice UUID
SLICE1="MIG-cccc-3333-dddd-4444"  # Second 20GB slice UUID

Run first model on slice 0 (binds container exclusively to first MIG instance):

docker run -d \
--gpus "device=$SLICE0" \    # Bind to specific MIG slice by UUID

  --name model-slice-0 \       # Container name for management

  -p 8000:8000 \              # Expose API on port 8000

  vllm/vllm-openai:latest \   # High-performance inference engine

  --model Qwen/Qwen3-4B       # Load the 4B parameter model

Run second model on slice 1 (independent instance on second MIG slice):

docker run -d \
--gpus "device=$SLICE1" \    # Bind to second MIG slice

  --name model-slice-1 \       # Different container name

  -p 8001:8000 \              # Different host port to avoid conflicts

  vllm/vllm-openai:latest \

  --model Qwen/Qwen3-4B

Step 5: Test Both Models

Test first model (sends request to model running on slice 0):

-X POST http://localhost:8000/v1/chat/completions \

  -H "Content-Type: application/json" \

  -d '{"model": "Qwen/Qwen3-4B", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 50}'

Test second model simultaneously (sends request to model on slice 1):

-X POST http://localhost:8000/v1/chat/completions \

  -H "Content-Type: application/json" \

  -d '{"model": "Qwen/Qwen3-4B", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 50}'

Monitor Memory Usage

View real-time GPU utilization (shows memory usage per MIG slice):

nvidia-smi  # Updates display, shows each slice independently

Example output:

|   0   0      N/A    5120MiB / 20480MiB |  15%  |  # Slice 0: ~5GB used
|   0   1      N/A    5120MiB / 20480MiB |  12%  |  # Slice 1: ~5GB used

Memory Slice Configurations

Small Models (Multiple 10GB Slices)

Create four 10GB slices (uses profile 19 for smaller memory footprints):

sudo nvidia-smi mig -i 0 -cgi 19,19,19,19 -C  # Creates 4x 1g.10gb instances

Deploy different models to each slice (runs 4 independent models):

for i in {0..3}; do
docker run -d --gpus "device=MIG-uuid-$i" --name "model-$i" \
    -p "$((8000+i)):8000" vllm/vllm-openai:latest --model Qwen/Qwen3-4B
done
# Each model gets its own port (8000, 8001, 8002, 8003)

Mixed Workloads

Create asymmetric slices (1 large + 2 small for different workload types):

sudo nvidia-smi mig -i 0 -cgi 9,19,19 -C  # 1x 40GB + 2x 10GB slices

Deploy larger model on big slice, smaller models on remaining slices:

# Large model needs more memory for complex tasks
docker run -d --gpus "device=$LARGE_SLICE" \
  vllm/vllm-openai:latest --model Qwen/Qwen3-14B

# Smaller models for quick inference tasks 
docker run -d --gpus "device=$SMALL_SLICE1" \
  vllm/vllm-openai:latest --model Qwen/Qwen3-4B

Key Benefits Achieved

Resource Utilization:

  • Before: 1 model, 6% GPU utilization (5GB/80GB)
  • After: 2 models, 12.5% utilization (10GB/80GB)
  • Potential: Up to 8 small models on same hardware

Performance:

  • Parallel Processing: Both models respond simultaneously
  • Guaranteed Memory: Each slice gets dedicated 20GB
  • No Interference: Heavy load on one slice doesn’t affect the other

Cost Efficiency:

  • Single Setup: ~€1,380 per model per month when running a dedicated full A100 GPU. This configuration maximizes raw performance but comes at the highest cost per model.
  • MIG Setup (2-way split): ~€690 per model per month when partitioning one A100 into 2 MIG instances. Each model gets half of the GPU, balancing efficiency and cost.
  • Max Efficiency (multi-instance): ~€172 per model per month if you partition the A100 into 8 logical instances and allocate one model per slice. In practice, A100 officially supports up to 7 concurrent MIG instances, so this figure represents a theoretical upper bound for cost-sharing.

Quick Management Commands

List all current MIG slices (shows active instances with UUIDs):

nvidia-smi -L  # Lists all GPU devices including MIG instances

Remove all MIG slices (destroys all instances, returns to single GPU):

sudo nvidia-smi mig -i 0 -dgi 0  # -dgi 0 = delete all GPU instances

Disable MIG completely (returns to full 80GB GPU mode):

sudo nvidia-smi -i 0 -mig 0  # -mig 0 disables MIG, requires GPU reset

Clean up Docker containers (stops and removes model containers):

docker stop model-slice-0 model-slice-1    # Stop running containers
docker rm model-slice-0 model-slice-1      # Remove stopped containers

Core Concept Summary

MIG transforms expensive GPU hardware into efficient multi-tenant infrastructure, enabling multiple AI models to share resources safely while maintaining performance guarantees.

Here is the list of the main features and advantages:

  1. MIG partitions GPU memory into isolated slices
  2. Each slice acts as independent GPU with dedicated resources
  3. Docker containers bind to specific slices using MIG UUIDs
  4. Multiple models run in parallel without interference
  5. Memory utilization increases dramatically from single-digit to potential 8x improvement

Do you want to experience the benefits of MIG and make the most of your infrastructure investment? Try our Cloud Server GPU with NVIDIA A100 graphics

 

CONDIVIDI SUI SOCIAL

Lascia un commento

Il tuo indirizzo email non sarà pubblicato. I campi obbligatori sono contrassegnati *

− 6 = 2