TABLE OF CONTENTS
Updated: 19 Nov 2024
NVIDIA H100 GPUs On-Demand
Advanced LLMs like Llama 3.1-70B, Qwen 2-72B, FLUX.1 and more require vast computing resources, multiple GPUs and high-throughput networking to function efficiently. Kubernetes provides an ideal platform to manage these complex workflows. With Hyperstack’s integration of Kubernetes, you get an excellent solution for orchestrating the complex infrastructure needed for these LLMs. Want to get started with LLMs and Kubernetes on Hyperstack, check out our quick guide below.
Please note: Hyperstack's on-demand Kubernetes is currently in beta testing.
Prerequisites
Before deploying your Large Language Model (LLM) on Hyperstack Kubernetes, you will need to have the following:
Hyperstack Account
New to Hyperstack? Follow these steps to get started:
-
Register for a Hyperstack account: Sign up at https://console.hyperstack.cloud.
-
Sign in: Log in to your Hyperstack account using your registered email and password to access the platform.
-
Activate your account: Follow the instructions in the activation email sent to you to complete the setup and verify your account.
-
Add credit: Top up your account balance to enable resource usage. Check out our documentation to learn more about billing.
-
Create an environment: Start by setting up your first environment on Hyperstack, selecting the configurations suited to your project needs
-
Create an SSH key: Generate an SSH key pair for secure server access.
-
Create an API key: Generate your API key in the dashboard to authenticate your API requests.
Did you know you can now sign up at Hyperstack using your Google account?
Kubernetes Cluster
To deploy an LLM on Hyperstack Kubernetes, you need to create a Hyperstack Kubernetes cluster. Check out our comprehensive guide on setting up and managing your Kubernetes environment for LLM.
Deploy an LLM with vLLM on Hyperstack Kubernetes
On Hyperstack, deploying LLMs using Kubernetes is a straightforward process. Follow the below steps to get started:
Configure vLLM
First, let’s set up and configure vLLM on the Kubernetes cluster. Run the commands below inside your Virtual Machine. Please note: the instructions below assume you have created a Kubernetes cluster with at least 4 worker nodes.
- Create a vLLM Namespace:
kubectl create ns vllm-ns
- Set deployment configuration:
The following Kubernetes deployment configuration sets up a vllm application with four pod replicas using thevllm/vllm-openai:latest
Docker image, running the vLLM API server with theNousResearch/Meta-Llama-3-8B-Instruct
model.
cat < vllm_deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: labels: app: vllm-app name: vllm namespace: vllm-ns spec: replicas: 4 selector: matchLabels: app: vllm-app strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: labels: app: vllm-app spec: containers: - command: - python3 - -m - vllm.entrypoints.openai.api_server - --model - NousResearch/Meta-Llama-3-8B-Instruct image: vllm/vllm-openai:latest imagePullPolicy: Always livenessProbe: failureThreshold: 3 httpGet: path: /health port: 8000 scheme: HTTP initialDelaySeconds: 240 periodSeconds: 5 successThreshold: 1 timeoutSeconds: 1 name: vllm-openai ports: - containerPort: 8000 protocol: TCP readinessProbe: failureThreshold: 3 httpGet: path: /health port: 8000 scheme: HTTP initialDelaySeconds: 240 periodSeconds: 5 successThreshold: 1 timeoutSeconds: 1 resources: limits: nvidia.com/gpu: "1" requests: nvidia.com/gpu: "1" volumeMounts: - mountPath: /root/.cache/huggingface name: cache-volume volumes: - emptyDir: {} name: cache-volume EOF
- Once the deployment YAML is ready, create a service to expose vLLM:
cat < vllm_service.yaml apiVersion: v1 kind: Service metadata: labels: app: vllm-app name: vllm-openai-svc namespace: vllm-ns spec: ports: - port: 8000 protocol: TCP targetPort: 8000 selector: app: vllm-app type: ClusterIP EOF
- Apply the deployment and service configurations:
kubectl apply -f vllm_deployment.yaml kubectl apply -f vllm_service.yaml
- Verify the deployment status:
kubectl describe deployments -n vllm-ns
- Forward the vLLM service to access it locally:
kubectl port-forward svc/vllm-openai-svc 8000:8000 -n vllm-ns
Test the Model
Finally, the deployed LLM model can be tested by sending a request to the vLLM API.
- Send a Test Request:
curl -X POST http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "NousResearch/Meta-Llama-3-8B-Instruct", "prompt": "Explain the benefits of Kubernetes.", "max_tokens": 50 }'
- Check the Response:
You should receive a JSON response with generated text. For example:
{ "id": "example-id", "object": "text_completion", "created": 1612802563, "model": "NousResearch/Meta-Llama-3-8B-Instruct", "choices": [ { "text": "Kubernetes provides automated deployment, scaling, and management of containerized applications. It improves resource utilization and enables easy scaling.", "index": 0, "finish_reason": "length" } ], "usage": { "total_tokens": 27 } }
For a comprehensive step-by-step deployment guide, check out our documentation on deploying LLMs with Kubernetes on Hyperstack.
Why Use Kubernetes for Generative AI?
Kubernetes offers scalable and cost-efficient solutions for Generative AI workloads like LLM training and inference. Its ability to auto-scale, optimise resource usage and integrate with a wide range of tools makes it ideal for AI.
How Hyperstack Supports Kubernetes Integration?
Hyperstack streamlines Kubernetes for AI by offering optimised VM images, shared storage via a CSI driver and single API requests to deploy or delete clusters. An upcoming auto-scaling feature ensures your resources match AI workload demands efficiently.
Read more about Kubernetes on Hyperstack here.
Conclusion
Kubernetes offers a flexible, scalable and highly efficient platform for deploying and managing LLM workloads. Hyperstack makes this process easier by providing optimised GPU resources, high-speed networking with SR-IOV, and pre-built environments tailored for AI and ML workloads.
Want to try our Beta API? Check out the API documentation below!
FAQs
What GPUs are supported for LLM deployments
Hyperstack supports a range of powerful GPUs optimised for demanding AI workloads like the NVIDIA A100 and NVIDIA H100 PCIe.
How do I create a Kubernetes cluster on Hyperstack?
You can create a Hyperstack Kubernetes cluster by following the detailed instructions in the Hyperstack documentation.
Is it necessary to create an API key for using Hyperstack?
Yes, an API key is required to authenticate your API requests when interacting with Hyperstack resources programmatically.
How can I test my deployed LLM model?
You can test your deployed model by sending a request to the vLLM API and checking the JSON response for the generated text output.
Subscribe to Hyperstack!
Enter your email to get updates to your inbox every week
Get Started
Ready to build the next big thing in AI?