Detailed Course Outline
1. Early and Late Fusion (1 hr)
- Use camera and LiDAR data to predict object positions.
- Convert various datatypes to make them neural network ready.
2. Intermediate Fusion (1 hr)
- Explore the theory behind effective multimodal model architecture.
- Train a Contrastive Pretraining model.
- Create a vector database.
3. Cross-modal Projection (2 hr)
- Converting a Language model into a Vision Language Model (VLM).
- Process PDFs with Optical Character Recognition (OCR) tools.
4. Model Orchestration (2 hr)
- Analyze video using Cosmos Nemotron.
- Use VSS to answer user queries about video content.
- Orchestrate with NVIDIA AI Blueprints.
5. Assessment (1 hr)
- Convert a pre-trained model to input a different datatype using projection.