<img alt="" src="https://secure.insightful-enterprise-intelligence.com/783141.png" style="display:none;">

NVIDIA H100 SXMs On-Demand at $3.00/hour - Reserve from just $2.10/hour. Reserve here

Deploy 8 to 16,384 NVIDIA H100 SXM GPUs on the AI Supercloud. Learn More

|

Published on 19 Nov 2024

How to Get Started with LLMs and Kubernetes on Hyperstack: A Quick Start Guide

TABLE OF CONTENTS

updated

Updated: 19 Nov 2024

NVIDIA H100 SXM On-Demand

Sign up/Login

Advanced LLMs like Llama 3.1-70B, Qwen 2-72B, FLUX.1 and more require vast computing resources, multiple GPUs and high-throughput networking to function efficiently. Kubernetes provides an ideal platform to manage these complex workflows. With Hyperstack’s integration of Kubernetes, you get an excellent solution for orchestrating the complex infrastructure needed for these LLMs. Want to get started with LLMs and Kubernetes on Hyperstack, check out our quick guide below. 

Please note: Hyperstack's on-demand Kubernetes is currently in beta testing.

nexgencloud__a1e71827-dbb8-41c2-b7bb-23c97b77be76 copy

Prerequisites

Before deploying your Large Language Model (LLM) on Hyperstack Kubernetes, you will need to have the following:

Hyperstack Account

New to Hyperstack? Follow these steps to get started:

  1. Register for a Hyperstack account: Sign up athttps://console.hyperstack.cloud

  2. Sign in: Log in to your Hyperstack account using your registered email and password to access the platform.

  3. Activate your account: Follow the instructions in the activation email sent to you to complete the setup and verify your account.

  4. Add credit: Top up your account balance to enable resource usage. Check out our documentation to learn more about billing.

  5. Create an environment: Start by setting up your first environment on Hyperstack, selecting the configurations suited to your project needs

  6. Create an SSH key: Generate an SSH key pair for secure server access.

  7. Create an API key: Generate your API key in the dashboard to authenticate your API requests.

nexgencloud_3d_32-bit_google_icon_isolated_in_a_white_backgroun_732de290-5ea9-4f46-9d3a-9a4d60951207 copy2Did you know you can now sign up at Hyperstack using your Google account? 

Kubernetes Cluster

To deploy an LLM on Hyperstack Kubernetes, you need to create a Hyperstack Kubernetes cluster. Check out our comprehensive guide on setting up and managing your Kubernetes environment for LLM. 

Deploy an LLM with vLLM on Hyperstack Kubernetes

On Hyperstack, deploying LLMs using Kubernetes is a straightforward process. Follow the below steps to get started:

Configure vLLM

First, let’s set up and configure vLLM on the Kubernetes cluster. Run the commands below inside your Virtual Machine. Please note: the instructions below assume you have created a Kubernetes cluster with at least 4 worker nodes.

  1. Create a vLLM Namespace:
    kubectl create ns vllm-ns
    
  2. Set deployment configuration:
    The following Kubernetes deployment configuration sets up a vllm application with four pod replicas using the vllm/vllm-openai:latest Docker image, running the vLLM API server with the NousResearch/Meta-Llama-3-8B-Instruct model.
    cat < vllm_deployment.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: vllm-app
      name: vllm
      namespace: vllm-ns
    spec:
      replicas: 4
      selector:
        matchLabels:
          app: vllm-app
      strategy:
        rollingUpdate:
          maxSurge: 25%
          maxUnavailable: 25%
        type: RollingUpdate
      template:
        metadata:
          labels:
            app: vllm-app
        spec:
          containers:
          - command:
            - python3
            - -m
            - vllm.entrypoints.openai.api_server
            - --model
            - NousResearch/Meta-Llama-3-8B-Instruct
            image: vllm/vllm-openai:latest
            imagePullPolicy: Always
            livenessProbe:
              failureThreshold: 3
              httpGet:
                path: /health
                port: 8000
                scheme: HTTP
              initialDelaySeconds: 240
              periodSeconds: 5
              successThreshold: 1
              timeoutSeconds: 1
            name: vllm-openai
            ports:
            - containerPort: 8000
              protocol: TCP
            readinessProbe:
              failureThreshold: 3
              httpGet:
                path: /health
                port: 8000
                scheme: HTTP
              initialDelaySeconds: 240
              periodSeconds: 5
              successThreshold: 1
              timeoutSeconds: 1
            resources:
              limits:
                nvidia.com/gpu: "1"
              requests:
                nvidia.com/gpu: "1"
            volumeMounts:
            - mountPath: /root/.cache/huggingface
              name: cache-volume
          volumes:
          - emptyDir: {}
            name: cache-volume
    EOF
    
  3. Once the deployment YAML is ready, create a service to expose vLLM:
    cat < vllm_service.yaml
    apiVersion: v1
    kind: Service
    metadata:
      labels:
        app: vllm-app
      name: vllm-openai-svc
      namespace: vllm-ns
    spec:
      ports:
      - port: 8000
        protocol: TCP
        targetPort: 8000
      selector:
        app: vllm-app
      type: ClusterIP
    EOF
    
  4. Apply the deployment and service configurations:
    kubectl apply -f vllm_deployment.yaml
    kubectl apply -f vllm_service.yaml
    
  5. Verify the deployment status:
    kubectl describe deployments -n vllm-ns
  6. Forward the vLLM service to access it locally:
    kubectl port-forward svc/vllm-openai-svc 8000:8000 -n vllm-ns

Test the Model

Finally, the deployed LLM model can be tested by sending a request to the vLLM API.

  1. Send a Test Request:
    curl -X POST http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "NousResearch/Meta-Llama-3-8B-Instruct",
      "prompt": "Explain the benefits of Kubernetes.",
      "max_tokens": 50
    }'
    
  2. Check the Response: 
    You should receive a JSON response with generated text. For example:
    {
      "id": "example-id",
      "object": "text_completion",
      "created": 1612802563,
      "model": "NousResearch/Meta-Llama-3-8B-Instruct",
      "choices": [
        {
          "text": "Kubernetes provides automated deployment, scaling, and management of containerized applications. It improves resource utilization and enables easy scaling.",
          "index": 0,
          "finish_reason": "length"
        }
      ],
      "usage": {
        "total_tokens": 27
      }
    }
    

For a comprehensive step-by-step deployment guide, check out our documentation on deploying LLMs with Kubernetes on Hyperstack.

Why Use Kubernetes for Generative AI?

Kubernetes offers scalable and cost-efficient solutions for Generative AI workloads like LLM training and inference. Its ability to auto-scale, optimise resource usage and integrate with a wide range of tools makes it ideal for AI. 

How Hyperstack Supports Kubernetes Integration?

Hyperstack streamlines Kubernetes for AI by offering optimised VM images, shared storage via a CSI driver and single API requests to deploy or delete clusters. An upcoming auto-scaling feature ensures your resources match AI workload demands efficiently. 

Read more about Kubernetes on Hyperstack here.

Conclusion

Kubernetes offers a flexible, scalable and highly efficient platform for deploying and managing LLM workloads. Hyperstack makes this process easier by providing optimised GPU resources, high-speed networking with SR-IOV, and pre-built environments tailored for AI and ML workloads.

Want to try our Beta API? Check out the API documentation below!

FAQs

What GPUs are supported for LLM deployments

Hyperstack supports a range of powerful GPUs optimised for demanding AI workloads like the NVIDIA A100 and NVIDIA H100 PCIe.

How do I create a Kubernetes cluster on Hyperstack?

You can create a Hyperstack Kubernetes cluster by following the detailed instructions in the Hyperstack documentation.

Is it necessary to create an API key for using Hyperstack?

Yes, an API key is required to authenticate your API requests when interacting with Hyperstack resources programmatically.

How can I test my deployed LLM model?

You can test your deployed model by sending a request to the vLLM API and checking the JSON response for the generated text output.

Subscribe to Hyperstack!

Enter your email to get updates to your inbox every week

Get Started

Ready to build the next big thing in AI?

Sign up now
Talk to an expert

Share On Social Media

10 Jan 2025

In 2024, Meta released Llama 3.1 405B as a groundbreaking open-source AI model leading ...

18 Dec 2024

Meta has surprisingly released Llama 3.3, marking a major leap in open-source AI. Llama ...

29 Nov 2024

The Hyperstack LLM Inference Toolkit is an open-source tool designed to simplify the ...