Advanced LLMs like Llama 3.1-70B, Qwen 2-72B, FLUX.1 and more require vast computing resources, multiple GPUs and high-throughput networking to function efficiently. Kubernetes provides an ideal platform to manage these complex workflows. With Hyperstack’s integration of Kubernetes, you get an excellent solution for orchestrating the complex infrastructure needed for these LLMs. Want to get started with LLMs and Kubernetes on Hyperstack, check out our quick guide below.
Please note: Hyperstack's on-demand Kubernetes is currently in beta testing.
Before deploying your Large Language Model (LLM) on Hyperstack Kubernetes, you will need to have the following:
New to Hyperstack? Follow these steps to get started:
Register for a Hyperstack account: Sign up at https://console.hyperstack.cloud.
Sign in: Log in to your Hyperstack account using your registered email and password to access the platform.
Activate your account: Follow the instructions in the activation email sent to you to complete the setup and verify your account.
Add credit: Top up your account balance to enable resource usage. Check out our documentation to learn more about billing.
Create an environment: Start by setting up your first environment on Hyperstack, selecting the configurations suited to your project needs
Create an SSH key: Generate an SSH key pair for secure server access.
Create an API key: Generate your API key in the dashboard to authenticate your API requests.
To deploy an LLM on Hyperstack Kubernetes, you need to create a Hyperstack Kubernetes cluster. Check out our comprehensive guide on setting up and managing your Kubernetes environment for LLM.
On Hyperstack, deploying LLMs using Kubernetes is a straightforward process. Follow the below steps to get started:
Configure vLLM
First, let’s set up and configure vLLM on the Kubernetes cluster. Run the commands below inside your Virtual Machine. Please note: the instructions below assume you have created a Kubernetes cluster with at least 4 worker nodes.
kubectl create ns vllm-ns
vllm/vllm-openai:latest
Docker image, running the vLLM API server with the NousResearch/Meta-Llama-3-8B-Instruct
model.cat < vllm_deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: vllm-app
name: vllm
namespace: vllm-ns
spec:
replicas: 4
selector:
matchLabels:
app: vllm-app
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
labels:
app: vllm-app
spec:
containers:
- command:
- python3
- -m
- vllm.entrypoints.openai.api_server
- --model
- NousResearch/Meta-Llama-3-8B-Instruct
image: vllm/vllm-openai:latest
imagePullPolicy: Always
livenessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 8000
scheme: HTTP
initialDelaySeconds: 240
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
name: vllm-openai
ports:
- containerPort: 8000
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 8000
scheme: HTTP
initialDelaySeconds: 240
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
volumeMounts:
- mountPath: /root/.cache/huggingface
name: cache-volume
volumes:
- emptyDir: {}
name: cache-volume
EOF
cat < vllm_service.yaml
apiVersion: v1
kind: Service
metadata:
labels:
app: vllm-app
name: vllm-openai-svc
namespace: vllm-ns
spec:
ports:
- port: 8000
protocol: TCP
targetPort: 8000
selector:
app: vllm-app
type: ClusterIP
EOF
kubectl apply -f vllm_deployment.yaml
kubectl apply -f vllm_service.yaml
kubectl describe deployments -n vllm-ns
kubectl port-forward svc/vllm-openai-svc 8000:8000 -n vllm-ns
Finally, the deployed LLM model can be tested by sending a request to the vLLM API.
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "NousResearch/Meta-Llama-3-8B-Instruct",
"prompt": "Explain the benefits of Kubernetes.",
"max_tokens": 50
}'
{
"id": "example-id",
"object": "text_completion",
"created": 1612802563,
"model": "NousResearch/Meta-Llama-3-8B-Instruct",
"choices": [
{
"text": "Kubernetes provides automated deployment, scaling, and management of containerized applications. It improves resource utilization and enables easy scaling.",
"index": 0,
"finish_reason": "length"
}
],
"usage": {
"total_tokens": 27
}
}
For a comprehensive step-by-step deployment guide, check out our documentation on deploying LLMs with Kubernetes on Hyperstack.
Kubernetes offers scalable and cost-efficient solutions for Generative AI workloads like LLM training and inference. Its ability to auto-scale, optimise resource usage and integrate with a wide range of tools makes it ideal for AI.
Hyperstack streamlines Kubernetes for AI by offering optimised VM images, shared storage via a CSI driver and single API requests to deploy or delete clusters. An upcoming auto-scaling feature ensures your resources match AI workload demands efficiently.
Read more about Kubernetes on Hyperstack here.
Kubernetes offers a flexible, scalable and highly efficient platform for deploying and managing LLM workloads. Hyperstack makes this process easier by providing optimised GPU resources, high-speed networking with SR-IOV, and pre-built environments tailored for AI and ML workloads.
Hyperstack supports a range of powerful GPUs optimised for demanding AI workloads like the NVIDIA A100 and NVIDIA H100 PCIe.
You can create a Hyperstack Kubernetes cluster by following the detailed instructions in the Hyperstack documentation.
Yes, an API key is required to authenticate your API requests when interacting with Hyperstack resources programmatically.
You can test your deployed model by sending a request to the vLLM API and checking the JSON response for the generated text output.