How to Get Started with LLMs and Kubernetes on Hyperstack: A Quick Start Guide

Written by Damanpreet Kaur Vohra | Nov 19, 2024 3:14:22 PM

Advanced LLMs like Llama 3.1-70B, Qwen 2-72B, FLUX.1 and more require vast computing resources, multiple GPUs and high-throughput networking to function efficiently. Kubernetes provides an ideal platform to manage these complex workflows. With Hyperstack’s integration of Kubernetes, you get an excellent solution for orchestrating the complex infrastructure needed for these LLMs. Want to get started with LLMs and Kubernetes on Hyperstack, check out our quick guide below.

Please note: Hyperstack's on-demand Kubernetes is currently in beta testing.

Prerequisites

Before deploying your Large Language Model (LLM) on Hyperstack Kubernetes, you will need to have the following:

Hyperstack Account

New to Hyperstack? Follow these steps to get started:

Register for a Hyperstack account: Sign up at https://console.hyperstack.cloud.
Sign in: Log in to your Hyperstack account using your registered email and password to access the platform.
Activate your account: Follow the instructions in the activation email sent to you to complete the setup and verify your account.
Add credit: Top up your account balance to enable resource usage. Check out our documentation to learn more about billing.
Create an environment: Start by setting up your first environment on Hyperstack, selecting the configurations suited to your project needs
Create an SSH key: Generate an SSH key pair for secure server access.
Create an API key: Generate your API key in the dashboard to authenticate your API requests.

Did you know you can now sign up at Hyperstack using your Google account?

Kubernetes Cluster

To deploy an LLM on Hyperstack Kubernetes, you need to create a Hyperstack Kubernetes cluster. Check out our comprehensive guide on setting up and managing your Kubernetes environment for LLM.

Deploy an LLM with vLLM on Hyperstack Kubernetes

On Hyperstack, deploying LLMs using Kubernetes is a straightforward process. Follow the below steps to get started:

Configure vLLM

First, let’s set up and configure vLLM on the Kubernetes cluster. Run the commands below inside your Virtual Machine. Please note: the instructions below assume you have created a Kubernetes cluster with at least 4 worker nodes.

Create a vLLM Namespace:
```
kubectl create ns vllm-ns
```

Set deployment configuration:
The following Kubernetes deployment configuration sets up a vllm application with four pod replicas using the vllm/vllm-openai:latest Docker image, running the vLLM API server with the NousResearch/Meta-Llama-3-8B-Instruct model.

cat < vllm_deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: vllm-app
  name: vllm
  namespace: vllm-ns
spec:
  replicas: 4
  selector:
    matchLabels:
      app: vllm-app
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: vllm-app
    spec:
      containers:
      - command:
        - python3
        - -m
        - vllm.entrypoints.openai.api_server
        - --model
        - NousResearch/Meta-Llama-3-8B-Instruct
        image: vllm/vllm-openai:latest
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: 8000
            scheme: HTTP
          initialDelaySeconds: 240
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        name: vllm-openai
        ports:
        - containerPort: 8000
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: 8000
            scheme: HTTP
          initialDelaySeconds: 240
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            nvidia.com/gpu: "1"
          requests:
            nvidia.com/gpu: "1"
        volumeMounts:
        - mountPath: /root/.cache/huggingface
          name: cache-volume
      volumes:
      - emptyDir: {}
        name: cache-volume
EOF

Once the deployment YAML is ready, create a service to expose vLLM:

cat < vllm_service.yaml
apiVersion: v1
kind: Service
metadata:
  labels:
    app: vllm-app
  name: vllm-openai-svc
  namespace: vllm-ns
spec:
  ports:
  - port: 8000
    protocol: TCP
    targetPort: 8000
  selector:
    app: vllm-app
  type: ClusterIP
EOF

Apply the deployment and service configurations:

kubectl apply -f vllm_deployment.yaml
kubectl apply -f vllm_service.yaml

Verify the deployment status:
```
kubectl describe deployments -n vllm-ns
```

Forward the vLLM service to access it locally:

kubectl port-forward svc/vllm-openai-svc 8000:8000 -n vllm-ns

Test the Model

Finally, the deployed LLM model can be tested by sending a request to the vLLM API.

Send a Test Request:

curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "NousResearch/Meta-Llama-3-8B-Instruct",
  "prompt": "Explain the benefits of Kubernetes.",
  "max_tokens": 50
}'

Check the Response:
You should receive a JSON response with generated text. For example:

{
  "id": "example-id",
  "object": "text_completion",
  "created": 1612802563,
  "model": "NousResearch/Meta-Llama-3-8B-Instruct",
  "choices": [
    {
      "text": "Kubernetes provides automated deployment, scaling, and management of containerized applications. It improves resource utilization and enables easy scaling.",
      "index": 0,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "total_tokens": 27
  }
}

For a comprehensive step-by-step deployment guide, check out our documentation on deploying LLMs with Kubernetes on Hyperstack.

Why Use Kubernetes for Generative AI?

Kubernetes offers scalable and cost-efficient solutions for Generative AI workloads like LLM training and inference. Its ability to auto-scale, optimise resource usage and integrate with a wide range of tools makes it ideal for AI.

How Hyperstack Supports Kubernetes Integration?

Hyperstack streamlines Kubernetes for AI by offering optimised VM images, shared storage via a CSI driver and single API requests to deploy or delete clusters. An upcoming auto-scaling feature ensures your resources match AI workload demands efficiently.

Read more about Kubernetes on Hyperstack here.

Conclusion

Kubernetes offers a flexible, scalable and highly efficient platform for deploying and managing LLM workloads. Hyperstack makes this process easier by providing optimised GPU resources, high-speed networking with SR-IOV, and pre-built environments tailored for AI and ML workloads.

Want to try our Beta API? Check out the API documentation below!

FAQs

What GPUs are supported for LLM deployments

Hyperstack supports a range of powerful GPUs optimised for demanding AI workloads like the NVIDIA A100 and NVIDIA H100 PCIe.

How do I create a Kubernetes cluster on Hyperstack?

You can create a Hyperstack Kubernetes cluster by following the detailed instructions in the Hyperstack documentation.

Is it necessary to create an API key for using Hyperstack?

Yes, an API key is required to authenticate your API requests when interacting with Hyperstack resources programmatically.

How can I test my deployed LLM model?

You can test your deployed model by sending a request to the vLLM API and checking the JSON response for the generated text output.

View full post