Building AI Agents with Multimodal Models (BAAMM) – Outline

Detailed Course Outline

1. Early and Late Fusion (1 hr)

  • Use camera and LiDAR data to predict object positions.
  • Convert various datatypes to make them neural network ready.

2. Intermediate Fusion (1 hr)

  • Explore the theory behind effective multimodal model architecture.
  • Train a Contrastive Pretraining model.
  • Create a vector database.

3. Cross-modal Projection (2 hr)

  • Converting a Language model into a Vision Language Model (VLM).
  • Process PDFs with Optical Character Recognition (OCR) tools.

4. Model Orchestration (2 hr)

  • Analyze video using Cosmos Nemotron.
  • Use VSS to answer user queries about video content.
  • Orchestrate with NVIDIA AI Blueprints.

5. Assessment (1 hr)

  • Convert a pre-trained model to input a different datatype using projection.