<img alt="" src="https://secure.insightful-enterprise-intelligence.com/783141.png" style="display:none;">

NVIDIA H100 SXMs On-Demand at $3.00/hour - Reserve from just $2.10/hour. Reserve here

Deploy 8 to 16,384 NVIDIA H100 SXM GPUs on the AI Supercloud. Learn More

|

Published on 29 Nov 2024

Get Started with the Hyperstack LLM Inference Toolkit

TABLE OF CONTENTS

updated

Updated: 29 Nov 2024

NVIDIA H100 SXM On-Demand

Sign up/Login

The Hyperstack LLM Inference Toolkit is an open-source tool designed to simplify the deployment, management and testing of Large Language Models (LLMs) using Hyperstack. Our toolkit is ideal for developers and researchers who need fast prototyping, intuitive API access and robust performance tracking. With seamless deployment options, streamlined proxy APIs and comprehensive model management, this toolkit accelerates LLM workflows for local and cloud-based setups.

Haven't heard about our LLM Inference Toolkit before? Check out our demo video below.

Want to get started with this toolkit? See our tutorial below. This tutorial provides a step-by-step guide to using the Hyperstack LLM Inference Toolkit, from setup to deploying your first LLM. You will need the files inside our Hyperstack LLM Inference Toolkit Repository/

LLM Inference Toolkit Tutorial Overview

Here's a quick overview of the steps involved in this tutorial:

  • Prepare Your Environment: Install required tools like Docker.
  • Choose Deployment Type: Opt for local or cloud deployment based on your needs.
  • Deploy the Toolkit: Execute deployment commands to set up the toolkit.
  • Interact with the Toolkit: Use the UI to deploy models, manage APIs, and monitor performance.

Prerequisites

Before proceeding, ensure you do the following:

  1. Download or clone our Hyperstack LLM Inference Toolkit repository here.
  2. Install Docker
    1. It is recommended to install Docker desktop (see instructions here).
  3. Install Docker Compose: See instructions here.
    1. Not needed if you have installed Docker desktop
  4. Install the Make utility:
    1. Windows: recommended to use Chocolately and then run command: choco install make
    2. Linux: run command: sudo apt-get install build-essential
    3. MacOs: Recommended to use Homebrew and run command: brew install make 

How to Deploy the LLM Toolkit?

After installing the prerequisites, choose between local or cloud deployment based on your needs.

a. Local Deployment

To deploy the toolkit in a local environment (mostly to quickly get started and/or local development), follow these steps:

  1. Rename the .env.example file to .env, which contains app configuration for the local environment. You can find this file in the repository shared above.
  2. Change at least the following environment variables in the .env file:
    • HYPERSTACK_API_KEY: Your Hyperstack API key (this will be needed if you decide to deploy any models on the 'Models' page through Hyperstack).
    • APP_PASSWORD: The password for the User Interface (UI)
  3. Build and start app services containers in attached mode:
    make dev-up

b. Cloud Deployment 

To deploy the toolkit in a cloud environment, follow the steps below. Please note that this requires a Hyperstack account with an environment and keypair configured. See our Getting Started tutorial here.

  1. Rename the deployment/build.manifest.yaml.example file to deployment/build.manifest.yaml, which contains app configuration for the cloud environment. You can find this file in the repository shared above.
  2. Change at least the following environment variables in the .env file:
    • env.HYPERSTACK_API_KEY: Your Hyperstack API key
    • env.APP_PASSWORD: The password for the User Interface (UI)
    • proxy_instance.key_name: Hyperstack keypair name for the Proxy VM
    • inference_engine_vms.[idx].key_name: Hyperstack keypair name for the inference VMs
  3. Build and start the deployment:
    make deploy-app
  4. Wait for the Proxy VM and the Inference VMs to boot up. Note: first the Proxy VM will be created and configured. Afterwards, the Inference VMs will be created and configured.
  5. By default, the toolkit deploys the VMs described below.  You can change the default deployment by following the instructions here.
    1. A proxy VM for routing requests and hosting the UI.
    2. Two inference VMs with preloaded models:
      • NousResearch/Meta-Llama-3.1-8B-instruct
      • microsoft/Phi-3-medium-128k-instruct

For more instructions on different build commands, check out our user documentation on GitHub.

How to use the LLM Inference Toolkit

To interact with the toolkit using the User Interface, you can open the LLM Inference Toolkit via:

After opening the URL above, you need to enter the App Password you have set in the previous step.

Please note: the current implementation does not utilize SSL for secure data transmission. Until SSL support is added, exercise caution and avoid transmitting sensitive or confidential data through the toolkit over the internet.

a. Playground

After opening the LLM Inference Toolkit, you can select "Playground" to get started with your LLMs. This page allows you to interact with one of your deployed LLMs model.

Instructions

  1. Select a Model: Choose a model from the sidebar to chat with.
  2. Enter Your Message: Type your message in the chat input box at the bottom of the page.
  3. Configure Assistant: Click the "Configure Assistant" button in the sidebar to change advanced settings such as temperature, max tokens, and presence penalty.
  4. Chat with Assistant: Once you've selected a model and entered your message, click the ">" button to start chatting with the assistant.

Additional Tips

  • You can stream the response from the API by checking the "Stream results" checkbox in the configure assistant dialog.
  • You can reset all previous conversations by clicking on the "Reset messages" button in the sidebar.
  • By default, the User_ID = 0 (default user) is used for interacting with the API.

b. Models page

This page allows you to view all deployed LLMs, add new models, and manage replicas for each model.

Instructions

  1. View Models: See all your deployed models listed on this page.
  2. Add New Model: Use the "Add New Model" section to deploy a new model.
    1. For more info on deploying a new LLM, check out our user documentation on GitHub.
  3. Manage Replicas: For each model, you can add, edit, or delete replicas.
  4. Add Replica: Click the "Add" button under a model to create a new replica. You can either provide an existing endpoint or deploy a new replica on Hyperstack.
  5. Delete Replica: Click the "Delete" button next to a replica to remove it. See warning note below.
  6. Delete Model: Click the "Delete" button next to a model to remove it. See warning note below.

Additional Tips

  • You can refresh the status of the replica by clicking on the 'Refresh' icon on the right side of the model name.
  • Make sure your model name matches the model name used by the inference engine such as vLLM. For example: NousResearch/Meta-Llama-3.1-8B-Instruct.
  • When deploying a new replica on Hyperstack, make sure to view the help (?) icon for more information.

Warning

  • Deleting a model or replica will only delete it from the database. Any cloud resources may still exist and incur billing costs.

c. API key management

This page allows you to create an API key for your user ID and view existing API keys. This API key can be used to interact with the chat/completion API endpoint.

Instructions

  1. Enter User ID: Input the user ID for which you want to generate an API key.
  2. Generate API Key: Click the "Generate API key" button to create a new API key.
  3. Delete API Key: Click the "Delete API key" button to remove an existing API key.
  4. View API Keys: Existing API keys will be displayed in a table below.

Additional Tips

  • API keys can only be deleted if they have not been used yet.
  • API keys that have been used can only be disabled.

d. Monitoring

This page allows you to view and interact with the data stored in your databases.

User Instructions

  1. Select a Table: Choose a table from the dropdown list to view its data.
  2. View Data: The data from the selected table will be displayed in a table format.
  3. Download CSV: You can download the data as a CSV file by clicking the download button.

Additional Tips

  • You can refresh the data by re-selecting the table from the dropdown list.
  • Metrics data may not be stored correctly for locally or externally deployed models.

Conclusion

We hope this tutorial helps you make the most of your LLM model with the Hyperstack LLM Inference Toolkit. Stay tuned for more developer-friendly resources to simplify your AI workflows. We’re here to make AI easy for you!

Want to Get Started with Latest Hyperstack Features? Check out our tutorials below:

FAQs

What is the Hyperstack LLM Inference Toolkit?

The LLM Inference Toolkit is an open-source tool that simplifies the deployment, management, and testing of Large Language Models (LLMs) using Hyperstack.

What models does the Hyperstack LLM Inference Toolkit support?

It supports open-source LLMs with flexible deployment options for local or cloud setups.

Does the toolkit include pre-built integrations for vLLM?

Yes, it provides proxy API integrations optimized for vLLM, ensuring high throughput and low latency.

Can I track real-time model performance?

Yes, our toolkit offers real-time tracking, including visualised data and usage insights.

Is the LLM Inference Toolkit secure?

Yes, it includes built-in password protection and API key enforcement for security.

Subscribe to Hyperstack!

Enter your email to get updates to your inbox every week

Get Started

Ready to build the next big thing in AI?

Sign up now
Talk to an expert

Share On Social Media

18 Dec 2024

Meta has surprisingly released Llama 3.3, marking a major leap in open-source AI. Llama ...

25 Nov 2024

Adopting open source across every industry is no longer an “if”, but a “what, when and ...