Guide: Deploying LLMs with KubeAI on k3s (Ubuntu)

Guide: Deploying LLMs with KubeAI on k3s (Ubuntu)

Seeweb allows you to deploy LLMs with KubeAI on k3s. This guide provides step-by-step instructions to deploy Hugging Face Large Language Models using KubeAI on a k3s Kubernetes cluster running on an Ubuntu Linux machine within Seeweb’s infrastructure. We'll use Qwen2.5-7B-Instruct as an example, but the same approach can be applied to other models. You will also learn how to access the model services via an SSH tunnel for local interaction.
Indice dei contenuti

Prerequisites

Before you begin, ensure your Ubuntu machine meets the following requirements:

  1. Ubuntu Linux Machine: A server or desktop running a recent version of Ubuntu (e.g., 20.04 LTS or 22.04 LTS).
  2. SSH Access: You can SSH into the machine.
  3. Sudo Privileges: You have sudo access.
  4. NVIDIA GPU: An NVIDIA GPU suitable for running the chosen Qwen model (Qwen2.5-7B-Instruct might need ~16GB+ VRAM for FP16. Quantized models may require less).
  5. NVIDIA Drivers: NVIDIA drivers correctly installed and nvidia-smi command working and showing your GPU.
  6. Internet Access: For downloading packages, Docker images, and models.
  7. Sufficient Resources: Adequate CPU cores, RAM (32GB-48GB+, depending on model precision and system overhead), and disk space.

  1.  

Section 1: Prepare the Ubuntu Machine

These steps prepare your Ubuntu system with necessary dependencies.

Update System Packages

sudo apt-get update
sudo apt-get upgrade -y

Verify NVIDIA Drivers

Ensure your NVIDIA drivers are installed and the GPU is recognized:

nvidia-smi

If this command fails or doesn’t show your GPU, you must install or troubleshoot your NVIDIA drivers before proceeding.

Install Docker

KubeAI and Kubernetes rely on a container runtime, typically Docker.

# Install prerequisites
sudo apt-get install -y apt-transport-https ca-certificates
curl software-properties-common

# Add Docker's official GPG key
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

# Install Docker CE
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io

Troubleshooting Docker: If sudo systemctl status docker shows issues, check logs with sudo journalctl -xeu docker.service. Common issues include port conflicts or resource limitations.

Install NVIDIA Container Toolkit

This allows Docker (and subsequently Kubernetes) to use your NVIDIA GPUs.

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Test if Docker can see the GPU:

sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

This command should output the same nvidia-smi details as before, but from within a Docker container.

Install Helm

Helm is a package manager for Kubernetes, used to install KubeAI.

curl https://baltocdn.com/helm/signing.asc | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg > /dev/null

sudo apt-get install apt-transport-https --yes echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list

sudo apt-get update
sudo apt-get install -y helm helm version

Section 2: Install and Configure k3s Kubernetes Cluster

k3s is a lightweight, certified Kubernetes distribution.

Install k3s

# Install K3s without Traefik (we don't need it)
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--disable=traefik" sh -

# Configure kubectl
mkdir -p $HOME/.kube
cp /etc/rancher/k3s/k3s.yaml $HOME/.kube/config export KUBECONFIG=$HOME/.kube/config echo "export KUBECONFIG=$HOME/.kube/config" >> $HOME/.bashrc

# Verify K3s is running
kubectl get nodes

You should see your node in a Ready state.

Troubleshooting k3s:

If kubectl get nodes fails (e.g., connection refused):

    1. Check k3s service status: sudo systemctl status k3s
    2. View logs: sudo journalctl -u k3s -f
    3. Restart if necessary: sudo systemctl restart k3s

Install NVIDIA GPU Operator

For Kubernetes to effectively manage and schedule workloads on NVIDIA GPUs, the NVIDIA GPU Operator is highly recommended.

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update helm install gpu-operator nvidia/gpu-operator \ --namespace gpu-operator --create-namespace \ --set driver.enabled=false \ --wait

Note: We set driver.enabled=false because the driver is already installed on the host. The operator will handle the rest (container runtime, device plugin, etc.).

Verify the operator pods are running:

kubectl get pods -n gpu-operator

Wait until all pods are in a Running or Completed state. This can take several minutes. Once the operator is running, your nodes with GPUs should be labeled and have allocatable GPU resources.

Check with:

kubectl describe node $(kubectl get nodes -o jsonpath='{.items[0].metadata.name}') | grep nvidia.com/gpu

You should see nvidia.com/gpu listed under Allocatable and Capacity.

Section 3: Install KubeAI with NVIDIA GPU Support

Now we’ll install KubeAI with specific NVIDIA GPU configurations.

# Add the KubeAI helm repository
helm repo add kubeai https://www.kubeai.org
helm repo update # Create namespace for KubeAI
kubectl create namespace kubeai # Download the values file for NVIDIA GPU curl -L -O https://raw.githubusercontent.com/substratusai/kubeai/refs/heads/main/charts/kubeai/values-nvidia-k8s-device-plugin.yaml

# Install KubeAI
helm upgrade --install kubeai kubeai/kubeai \ -f values-nvidia-k8s-device-plugin.yaml \ --namespace kubeai \ --wait

If you need to use Hugging Face models that require authentication, add

--set secrets.huggingface.token=$HF_TOKEN

to the helm command above (after exporting your token with export HF_TOKEN=your-hugging-face-token).

Section 4: Deploy Qwen2.5-7B-Instruct Model

Let’s deploy the Qwen2.5-7B-Instruct model using KubeAI’s Model CRD (Custom Resource Definition).

cat <<EOF | kubectl apply -f -
apiVersion: kubeai.org/v1
kind: Model
metadata:
name: qwen2.5-7b
namespace: kubeai
spec:
features: [TextGeneration]
owner: Qwen
url: hf://Qwen/Qwen2.5-7B-Instruct
engine: VLLM
resourceProfile: nvidia-gpu-l40:1
minReplicas: 1
EOF

Check if the model deployment is in progress:

kubectl get model -n kubeai
kubectl get pods -n kubeai

Wait for the model pod to reach Running state. This may take several minutes as the container downloads the model from Hugging Face.

Section 5: Access the Model Service

To access the model service from your local machine, you’ll need to set up both port forwarding on the server and an SSH tunnel from your local machine:

Step 1: Create SSH Tunnel

On your local machine, create an SSH tunnel to the server:

ssh -i ~/.ssh/your_key -L 8080:localhost:8080 root@your_server_ip

Replace your_key with your SSH key file name and your_server_ip with your server’s IP address.

Step 2: Set Up Port Forwarding on the Server

Once the SSH tunnel is active, on the remote server run:

kubectl -n kubeai port-forward svc/open-webui 8080:80

This forwards the KubeAI service to port 8080 on the server, which is then tunneled to port 8080 on your local machine through SSH.

Section 6: Testing the Model

With the port forwarding active, you can now:

    1. Open your web browser and navigate to http://localhost:8080
    2. Use the KubeAI web interface to interact with your Qwen2.5-7B-Instruct model

Alternatively, you can make API calls directly:

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-7b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"temperature": 0.7,
"max_tokens": 200
}'

Conclusion

You have now successfully deployed a Qwen2.5-7B-Instruct model on your Ubuntu machine using k3s and KubeAI. This setup provides a lightweight yet powerful infrastructure for running AI models locally with GPU acceleration.

For more advanced configurations, including multiple models, custom resource profiles, or integration with other services, refer to the official KubeAI documentation.

CONDIVIDI SUI SOCIAL

Lascia un commento

Il tuo indirizzo email non sarà pubblicato. I campi obbligatori sono contrassegnati *

3 + 7 =