Mark Zuckerberg said “AI has more potential than any other modern technology to increase human productivity, creativity and quality of life” in an open letter last week and he meant it. With this open-science approach, Meta has released its latest model SAM 2's code, model weights and the SA-V dataset with a permissive Apache 2.0 license including over 51,000 videos and 600,000 masklets. SAM 2, unlike its predecessor, excels in real-time promptable segmentation to accurately segment objects in images and videos—even those it hasn't been specifically trained on. This zero-shot capability makes SAM 2 exceptionally versatile, from enhancing video effects and creative projects by integrating with generative video models to streamlining the annotation process for visual data, which could significantly accelerate the development of advanced computer vision systems.
Continue reading as we explore the capabilities of SAM 2 and guide you on getting started with this model on Hyperstack.
SAM 2, short for Segment Anything Model 2 is Meta's latest advancement in computer vision technology. While the previous SAM model was known for segmenting objects in images, SAM 2 extends this capability to videos, creating a unified model for real-time promptable object segmentation across static and moving visual content. What sets SAM 2 apart is its ability to perform this task with remarkable accuracy and speed, even on objects and visual domains it has not seen previously.
With SAM 2, you can create and build better computer vision systems. SAM 2 outperforms its previous model SAM with the latest features including:
One of SAM 2's most significant advancements is its ability to handle images and videos within a single and unified architecture. The model treats images as short videos with a single frame, allowing seamless application across different visual media. This unified approach enables consistent performance and user experience whether working with static images or complex video sequences.
SAM 2 is designed for real-time operation, processing video frames at approximately 44 frames per second. This speed makes it suitable for live video applications, interactive editing, and other time-sensitive tasks where immediate feedback is crucial.
Building on SAM's foundation, SAM 2 allows users to specify objects of interest through various prompt types, including clicks, bounding boxes, or masks. These prompts can be applied to any frame in a video, with the model then propagating the segmentation across all frames. This interactive approach allows for precise control and refinement of segmentations.
To handle the temporal aspects of video segmentation, SAM 2 introduces a sophisticated memory mechanism. This consists of a memory encoder, a memory bank, and a memory attention module. These components allow the model to store and recall information about objects and user interactions across video frames, enabling consistent tracking and segmentation of objects throughout a video sequence.
SAM 2 includes an "occlusion head" that predicts whether an object of interest is present in each frame. This feature allows the model to handle scenarios where objects become temporarily hidden or move out of view, a common challenge in video segmentation tasks.
The model can output multiple valid masks when faced with ambiguous prompts, such as a click that could refer to a part of an object or the entire object. SAM 2 handles this by creating multiple masks and selecting the most confident one for propagation if the ambiguity isn't resolved through additional prompts.
SAM 2 has shown impressive performance across different benchmarks and real-world scenarios. Check below the performance results of SAM2:
While Sam 2 model is a major advancement in object segmentation technology, it does have some limitations:
Sam 2 model is being released as an open-source model. Meta continues to live up to its commitment to open science and collaborative AI development with this model. The SAM 2 code and model weights are shared under a permissive Apache 2.0 license. It is important to know that Meta is releasing the SAM 2 evaluation code under a BSD-3 license. This allows researchers and developers to not only use the model but also thoroughly assess its performance and compare it with other solutions.
The open-source release of Sam 2 model also includes:
“Open source will ensure that more people around the world have access to the benefits and opportunities of AI, that power isn’t concentrated in the hands of a small number of companies, and that the technology can be deployed more evenly and safely across society”
- Mark Zuckerberg, Founder and CEO, Meta
To get started with SAM 2, you can leverage the high-performance computing capabilities offered by Hyperstack. Here's how to install SAM2:
SAM 2 benefits from high-end GPUs for optimal performance, especially when processing videos or large batches of images. SAM2 GPU requirements could be intensive considering the model size. Hence, we recommend using powerful GPUs like the NVIDIA A100 and the NVIDIA H100 PCIe. Hyperstack offers access to these GPUs at a cost-effective GPU pricing model so you can run SAM2 efficiently without investing in expensive hardware.
To leverage Hyperstack's high-performance GPUs, you'll need to set up your environment on our platform. Check out our platform video demo to get started with Hyperstack.
This will also download the SAM2 models.
# Install python3-pip, python3-venv
sudo apt-get install python3-pip python3-venv -y
# Configure virtual environment
python3 -m venv venv
source venv/bin/activate
# Clone the repository
git clone https://github.com/facebookresearch/segment-anything-2.git
cd segment-anything-2
# Install requirements (including demo requirements)
pip install -e .
# Install demo requirements (adjusted because the command in the repo 'pip install -e ".[demo]"' is broken)
pip install jupyter==1.0.0 matplotlib==3.9.0 opencv-python==4.10.0.84
# Download checkpoints
cd checkpoints
./download_ckpts.sh
To run the segmentation on images or videos, see this video example notebook and this image example notebook. Please note, that if you want to run these notebooks, you need to set up a Jupyter notebook server. See the below steps to learn how to run SAM2:
1. Run a Jupyter Notebook server by running the command below in your VM. Copy the text after '?token' in the text that is printed in your terminal (e.g. http://localhost:8888/lab?token=a919e699f41dcc0c34754464cbaf55e0faa59bde96361b85
)
source /home/ubuntu/venv/bin/activate jupyter lab
2. Open an SSH tunnel by running this command from your local terminal (NOT in your VM). Replace the path to your keypair and ip address accordingly.
ssh -i [path-to-your-keypair] -L 8888:localhost:8888 ubuntu@[vm-ip-address]
3. Open http://localhost:8888 in your browser to view your notebooks
4. The Jupyter Notebook server will ask you for your token (see step 2). You only need the text after '?token' (e.g: a919e699f41dcc0c34754464cbaf55e0faa59bde96361b85)
5. Go to the 'notebooks' directory and execute the example notebook you need. Please note that you may need to add a pip install cell at the top for the required dependencies.
The key difference is that SAM 2 extends object segmentation capabilities to videos, while the original SAM was designed for images only. SAM 2 also offers improved performance and speed for image segmentation tasks.
Yes, SAM 2 is an open-source model released by Meta. The SAM 2 code and model weights are shared under a permissive Apache 2.0 license.
SAM 2 is designed for real-time operation, processing approximately 44 frames per second, making it suitable for live video applications and interactive editing.
No, SAM 2 features zero-shot generalisation, allowing it to segment objects it hasn't seen during training without custom adaptation.
Yes, SAM 2 has potential applications in various scientific fields, including medical imaging. It has been used to segment cellular images and aid in tasks like skin cancer detection.