Prerequisites
Before you begin, ensure your Ubuntu machine meets the following requirements:
- Ubuntu Linux Machine: A server or desktop running a recent version of Ubuntu (e.g., 20.04 LTS or 22.04 LTS).
- SSH Access: You can SSH into the machine.
- Sudo Privileges: You have sudo access.
- NVIDIA GPU: An NVIDIA GPU suitable for running the chosen Qwen model (Qwen2.5-7B-Instruct might need ~16GB+ VRAM for FP16. Quantized models may require less).
- NVIDIA Drivers: NVIDIA drivers correctly installed and nvidia-smi command working and showing your GPU.
- Internet Access: For downloading packages, Docker images, and models.
- Sufficient Resources: Adequate CPU cores, RAM (32GB-48GB+, depending on model precision and system overhead), and disk space.
Section 1: Prepare the Ubuntu Machine
These steps prepare your Ubuntu system with necessary dependencies.
Update System Packages
sudo apt-get update
sudo apt-get upgrade -y
Verify NVIDIA Drivers
Ensure your NVIDIA drivers are installed and the GPU is recognized:
nvidia-smi
If this command fails or doesn’t show your GPU, you must install or troubleshoot your NVIDIA drivers before proceeding.
Install Docker
KubeAI and Kubernetes rely on a container runtime, typically Docker.
# Install prerequisites
sudo apt-get install -y apt-transport-https ca-certificatescurlsoftware-properties-common
# Add Docker's official GPG keycurl-fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
# Install Docker CE
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io
Troubleshooting Docker: If sudo systemctl status docker shows issues, check logs with sudo journalctl -xeu docker.service. Common issues include port conflicts or resource limitations.
Install NVIDIA Container Toolkit
This allows Docker (and subsequently Kubernetes) to use your NVIDIA GPUs.
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)curl-s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -curl-s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
Test if Docker can see the GPU:
sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
This command should output the same nvidia-smi details as before, but from within a Docker container.
Install Helm
Helm is a package manager for Kubernetes, used to install KubeAI.
curlhttps://baltocdn.com/helm/signing.asc | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg > /dev/null
sudo apt-get install apt-transport-https --yes echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
sudo apt-get update
sudo apt-get install -y helm helm version
Section 2: Install and Configure k3s Kubernetes Cluster
k3s is a lightweight, certified Kubernetes distribution.
Install k3s
# Install K3s without Traefik (we don't need it)curl-sfL https://get.k3s.io | INSTALL_K3S_EXEC="--disable=traefik" sh -
# Configure kubectl
mkdir -p $HOME/.kube
cp /etc/rancher/k3s/k3s.yaml $HOME/.kube/config export KUBECONFIG=$HOME/.kube/config echo "export KUBECONFIG=$HOME/.kube/config" >> $HOME/.bashrc
# Verify K3s is running
kubectl get nodes
You should see your node in a Ready state.
Troubleshooting k3s:
If kubectl get nodes fails (e.g., connection refused):
-
- Check k3s service status: sudo systemctl status k3s
- View logs: sudo journalctl -u k3s -f
- Restart if necessary: sudo systemctl restart k3s
Install NVIDIA GPU Operator
For Kubernetes to effectively manage and schedule workloads on NVIDIA GPUs, the NVIDIA GPU Operator is highly recommended.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update helm install gpu-operator nvidia/gpu-operator \ --namespace gpu-operator --create-namespace \ --set driver.enabled=false \ --wait
Note: We set driver.enabled=false because the driver is already installed on the host. The operator will handle the rest (container runtime, device plugin, etc.).
Verify the operator pods are running:
kubectl get pods -n gpu-operator
Wait until all pods are in a Running or Completed state. This can take several minutes. Once the operator is running, your nodes with GPUs should be labeled and have allocatable GPU resources.
Check with:
kubectl describe node $(kubectl get nodes -o jsonpath='{.items[0].metadata.name}') | grep nvidia.com/gpu
You should see nvidia.com/gpu listed under Allocatable and Capacity.
Section 3: Install KubeAI with NVIDIA GPU Support
Now we’ll install KubeAI with specific NVIDIA GPU configurations.
# Add the KubeAI helm repository helm repo add kubeai https://www.kubeai.org
helm repo update # Create namespace for KubeAI
kubectl create namespace kubeai # Download the values file for NVIDIA GPUcurl-L -O https://raw.githubusercontent.com/substratusai/kubeai/refs/heads/main/charts/kubeai/values-nvidia-k8s-device-plugin.yaml
# Install KubeAI
helm upgrade --install kubeai kubeai/kubeai \ -f values-nvidia-k8s-device-plugin.yaml \ --namespace kubeai \ --wait
If you need to use Hugging Face models that require authentication, add
--set secrets.huggingface.token=$HF_TOKEN
to the helm command above (after exporting your token with export HF_TOKEN=your-hugging-face-token).
Section 4: Deploy Qwen2.5-7B-Instruct Model
Let’s deploy the Qwen2.5-7B-Instruct model using KubeAI’s Model CRD (Custom Resource Definition).
cat <<EOF | kubectl apply -f -
apiVersion: kubeai.org/v1
kind: Model
metadata:
name: qwen2.5-7b
namespace: kubeai
spec:
features: [TextGeneration]
owner: Qwen
url: hf://Qwen/Qwen2.5-7B-Instruct
engine: VLLM
resourceProfile: nvidia-gpu-l40:1
minReplicas: 1
EOF
Check if the model deployment is in progress:
kubectl get model -n kubeai
kubectl get pods -n kubeai
Wait for the model pod to reach Running state. This may take several minutes as the container downloads the model from Hugging Face.
Section 5: Access the Model Service
To access the model service from your local machine, you’ll need to set up both port forwarding on the server and an SSH tunnel from your local machine:
Step 1: Create SSH Tunnel
On your local machine, create an SSH tunnel to the server:
ssh -i ~/.ssh/your_key -L 8080:localhost:8080 root@your_server_ip
Replace your_key with your SSH key file name and your_server_ip with your server’s IP address.
Step 2: Set Up Port Forwarding on the Server
Once the SSH tunnel is active, on the remote server run:
kubectl -n kubeai port-forward svc/open-webui 8080:80
This forwards the KubeAI service to port 8080 on the server, which is then tunneled to port 8080 on your local machine through SSH.
Section 6: Testing the Model
With the port forwarding active, you can now:
-
- Open your web browser and navigate to http://localhost:8080
- Use the KubeAI web interface to interact with your Qwen2.5-7B-Instruct model
Alternatively, you can make API calls directly:
curlhttp://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-7b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"temperature": 0.7,
"max_tokens": 200
}'
Conclusion
You have now successfully deployed a Qwen2.5-7B-Instruct model on your Ubuntu machine using k3s and KubeAI. This setup provides a lightweight yet powerful infrastructure for running AI models locally with GPU acceleration.
For more advanced configurations, including multiple models, custom resource profiles, or integration with other services, refer to the official KubeAI documentation.
Articoli correlati:
- Complete Guide to Deploying DeepSeek-R1 on AMD MI300X GPUs + Open WebUI: Enterprise AI Solution
- Accelerating LLM Inference with vLLM: A Hands-on Guide
- Getting Started with Ollama: A Hands-on Guide
- GPU Memory Slicing: Running Multiple AI Models with NVIDIA MIG
- Cloud Server GPU: applicazioni nella ricerca
- Cloud GPU: primi passi e installazione driver


