SAM 2: Meta's Next-Gen Model for Video and Image Segmentation

Nov 11, 2024 07:08 PM - 1 month ago 40242

Introduction

The era has arrived wherever your telephone aliases machine tin understand the objects of an image, acknowledgment to technologies for illustration YOLO and SAM.

Meta’s Segment Anything Model (SAM) tin instantly place objects successful images and abstracted them without needing to beryllium trained connected circumstantial images. It’s for illustration a integer magician, capable to understand each entity successful an image pinch conscionable a activity of its virtual wand. After the successful merchandise of llama 3.1, Meta announced SAM 2 connected July 29th, a unified exemplary for real-time entity segmentation successful images and videos, which has achieved state-of-the-art performance.

SAM 2 offers galore real-world applications. For instance, its outputs tin beryllium integrated pinch generative video models to create innovative video effects and unlock caller imaginative possibilities. Additionally, SAM 2 tin heighten ocular information note tools, speeding up the improvement of much precocious machine imagination systems.

SAM 2 involves a task, a model, and data

SAM 2 involves a task, a model, and information (Image Source)

What is Image Segmentation successful SAM?

Segment Anything (SAM) introduces an image segmentation task wherever a segmentation disguise is generated from an input prompt, specified arsenic a bounding container aliases constituent indicating the entity of interest. Trained connected the SA-1B dataset, SAM supports zero-shot segmentation pinch elastic prompting, making it suitable for various applications. Recent advancements person improved SAM’s value and efficiency. HQ-SAM enhances output value utilizing a High-Quality output token and training connected fine-grained masks. Efforts to summation ratio for broader real-world usage see EfficientSAM, MobileSAM, and FastSAM. SAM’s occurrence has led to its exertion successful fields for illustration aesculapian imaging, distant sensing, mobility segmentation, and camouflaged entity detection.

Dataset Used

Many datasets person been developed to support the video entity segmentation (VOS) task. Early datasets characteristic high-quality annotations but are excessively mini for training heavy learning models. YouTube-VOS, the first large-scale VOS dataset, covers 94 entity categories crossed 4,000 videos. As algorithms improved and benchmark capacity plateaued, researchers accrued the VOS task trouble by focusing connected occlusions, agelong videos, utmost transformations, and some entity and segment diversity. Current video segmentation datasets deficiency the breadth needed to " conception thing successful videos," arsenic their annotations typically screen full objects wrong circumstantial classes for illustration people, vehicles, and animals. In contrast, the precocious introduced SA-V dataset focuses not only connected full objects but besides extensively connected entity parts, containing complete an bid of magnitude much masks. The SA-V dataset collected comprises of 50.9K videos pinch 642.6K masklets.

Example videos from the SA-V dataset pinch masklets

Example videos from the SA-V dataset pinch masklets (Image Source)

Model Architecture

The exemplary extends SAM to activity pinch some videos and images. SAM 2 tin usage point, box, and disguise prompts connected individual frames to specify the spatial grade of the entity to beryllium segmented passim the video. When processing images, the exemplary operates likewise to SAM. A lightweight, promptable disguise decoder takes a frame’s embedding and immoderate prompts to make a segmentation mask. Prompts tin beryllium added iteratively to refine the masks.

Unlike SAM, the framework embedding utilized by the SAM 2 decoder isn’t taken straight from the image encoder. Instead, it’s conditioned connected memories of past predictions and prompts from erstwhile frames, including those from “future” frames comparative to the existent one. The representation encoder creates these memories based connected the existent prediction and stores them successful a representation slope for early use. The representation attraction cognition uses the per-frame embedding from the image encoder and conditions it connected the representation slope to nutrient an embedding that is passed to the disguise decoder.

SAM 2 Architecture

SAM 2 Architecture. In each frame, the segmentation prediction is based connected the existent punctual and immoderate antecedently observed memories. Videos are processed successful a streaming manner, pinch frames being analyzed 1 astatine a clip by the image encoder, which cross-references memories of the target entity from earlier frames. The disguise decoder, which tin besides usage input prompts, predicts the segmentation disguise for the frame. Finally, a representation encoder transforms the prediction and image encoder embeddings (not shown successful the figure) for usage successful early frames. (Image Source)

Here’s a simplified mentation of the different components and processes coming successful the image:

Image Encoder

  • Purpose: The image encoder processes each video framework to create characteristic embeddings, which are fundamentally condensed representations of the ocular accusation successful each frame.
  • How It Works: It runs only erstwhile for the full video, making it efficient. MAE and Hiera extracts features astatine different levels of item to thief pinch meticulous segmentation.

Memory Attention

  • Purpose: Memory attraction helps the exemplary usage accusation from erstwhile frames and immoderate caller prompts to amended the existent frame’s segmentation.
  • How It Works: It uses a bid of transformer blocks to process the existent frame’s features, comparison them pinch memories of past frames, and update the segmentation based connected both. This helps grip analyzable scenarios wherever objects mightiness move aliases alteration complete time.

Prompt Encoder and Mask Decoder

  • Prompt Encoder: Similar to SAM’s, it takes input prompts (like clicks aliases boxes) to specify what portion of the framework to segment. It uses these prompts to refine the segmentation.
  • Mask Decoder: It useful pinch the punctual encoder to make meticulous masks. If a punctual is unclear, it predicts aggregate imaginable masks and selects the champion 1 based connected overlap pinch the object.

Memory Encoder and Memory Bank

  • Memory Encoder: This constituent creates memories of past frames by summarizing and combining accusation from erstwhile masks and the existent frame. This helps the exemplary retrieve and usage accusation from earlier successful the video.
  • Memory Bank: It stores memories of past frames and prompts. This includes a queue of caller frames and prompts and high-level entity information. It helps the exemplary support way of entity changes and movements complete time.

Training

  • Purpose: The exemplary is trained to grip interactive prompting and segmentation tasks utilizing some images and videos.
  • How It Works: During training, the exemplary learns to foretell segmentation masks by interacting pinch sequences of frames. It receives prompts for illustration ground-truth masks, clicks, aliases bounding boxes to guideline its predictions. This helps the exemplary go bully astatine responding to various types of input and improving its segmentation accuracy.

Overall, the exemplary is designed to efficiently grip agelong videos, retrieve accusation from past frames, and accurately conception objects based connected interactive prompts.

SAM 2 Performance

SAM 2 Object segmentation

SAM 2 Object segmentation

SAM Comparison pinch SAM 2

SAM Comparison pinch SAM 2

SAM 2 importantly outperforms erstwhile methods successful interactive video segmentation, achieving superior results crossed 17 zero-shot video datasets and requiring astir 3 times less quality interactions. It surpasses SAM successful its zero-shot benchmark suite by being six times faster and excels successful established video entity segmentation benchmarks for illustration DAVIS, MOSE, LVOS, and YouTube-VOS. With real-time conclusion astatine astir 44 frames per second, SAM 2 is 8.4 times faster than manual per-frame note pinch SAM.

How to instal SAM 2?

!git clone https://github.com/facebookresearch/segment-anything-2.git cd segment-anything-2 !pip instal -e .

To usage the SAM 2 predictor and tally the illustration notebooks, jupyter and matplotlib are required and tin beryllium installed by:

pip instal -e ".[demo]"

Download the checkpoints

cd checkpoints ./download_ckpts.sh

How to usage SAM 2?

Image prediction

SAM 2 tin beryllium utilized for fixed images to conception objects. SAM 2 offers image prediction APIs akin to SAM for these usage cases. The SAM2ImagePredictor people provides a user-friendly interface for image prompting.

import torch from sam2.build_sam import build_sam2 from sam2.sam2_image_predictor import SAM2ImagePredictor checkpoint = "./checkpoints/sam2_hiera_large.pt" model_cfg = "sam2_hiera_l.yaml" predictor = SAM2ImagePredictor(build_sam2(model_cfg, checkpoint)) with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16): predictor.set_image(<your_image>) masks, _, _ = predictor.predict(<input_prompts>)

Video prediction

SAM 2 supports video predictor arsenic good connected aggregate objects and besides uses an conclusion authorities to support way of the interactions successful each video.

import torch from sam2.build_sam import build_sam2_video_predictor checkpoint = "./checkpoints/sam2_hiera_large.pt" model_cfg = "sam2_hiera_l.yaml" predictor = build_sam2_video_predictor(model_cfg, checkpoint) with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16): authorities = predictor.init_state(<your_video>) frame_idx, object_ids, masks = predictor.add_new_points(state, <your prompts>): for frame_idx, object_ids, masks in predictor.propagate_in_video(state): ...

0:00

/0:07

In the video we person utilized SAM 2 to conception the java mug

Summary

  • SAM 2 Overview: SAM 2 builds connected SAM by extending its capabilities from images to videos. It uses prompts for illustration clicks, bounding boxes, aliases masks to specify entity boundaries successful each frame. A lightweight disguise decoder processes these prompts and generates segmentation masks for each frame.
  • Video Processing: In videos, SAM 2 applies the first disguise prediction crossed each frames to create a masklet. It allows for iterative refinement by adding prompts to consequent frames.
  • Memory Mechanism: For video segmentation, SAM 2 uses a representation encoder, representation bank, and representation attraction module. The representation encoder stores framework accusation and personification interactions, enabling meticulous predictions crossed frames. The representation slope holds information from erstwhile and prompted frames, which the representation attraction module uses to refine predictions.
  • Streaming Architecture: SAM 2 processes frames 1 astatine a clip successful a streaming fashion, making it businesslike for agelong videos and real-time applications for illustration robotics. It uses the representation attraction module to incorporated past framework information into existent predictions.
  • Handling Ambiguity: SAM 2 addresses ambiguity by generating aggregate masks erstwhile prompts are unclear. If prompts do not resoluteness the ambiguity, the exemplary selects the disguise pinch the highest assurance for further usage passim the video.

SAM 2 Limitations

  • Performance and Improvement: While SAM 2 performs good successful segmenting objects successful images and short videos, its capacity tin beryllium enhanced, particularly successful challenging scenarios.
  • Challenges successful Tracking: SAM 2 whitethorn struggle pinch drastic changes successful camera viewpoints, agelong occlusions, crowded scenes, aliases lengthy videos. To reside this, the exemplary is designed to beryllium interactive, allowing users to manually correct search pinch clicks connected immoderate framework to retrieve the target object.
  • Object Confusion: When the target entity is specified successful only 1 frame, SAM 2 mightiness confuse it pinch akin objects. Additional refinement prompts successful early frames tin resoluteness these issues, ensuring the correct masklet is maintained passim the video.
  • Multiple Object Segmentation: Although SAM 2 tin conception aggregate objects simultaneously, its ratio decreases importantly because each entity is processed separately utilizing only shared per-frame embeddings. Incorporating shared object-level discourse could amended efficiency.
  • Fast-Moving Objects: For complex, fast-moving objects, SAM 2 mightiness miss good details, starring to unstable predictions crossed frames, arsenic shown pinch the cyclist example. While adding prompts tin partially region this, predictions whitethorn still deficiency temporal smoothness since the exemplary isn’t penalized for oscillation betwixt frames.
  • Data Annotation and Automation: Despite advances successful automatic masklet procreation utilizing SAM 2, quality annotators are still needed for verifying value and identifying frames that request correction. Future improvements could further automate the information note process to boost efficiency.

References

  • Segment Anything Github
  • SAM 2: Segment Anything Model 2
  • SAM 2: Segment Anything successful Images and Videos Original Research Paper
  • Introducing SAM 2: The adjacent procreation of Meta Segment Anything Model for videos and images
More