Modern Data Warehousing / Data Engineering with Databricks on Azure Hackathon (MDWDE-HACK-AZURE) – Outline

Detailed Course Outline

This hackathon is structured as a progressive journey, designed to immerse participants in the practical application of cutting-edge data processing and transformation techniques on Databricks within the Azure ecosystem. The challenges participants are about to embark upon are interconnected and build directly upon one another. They will step into the shoes of a newly established digital department within a rapidly growing retail corporation, tasked with transforming raw data into valuable assets.

Participants will gradually gain access to the necessary Databricks features and Azure resources to address each challenge, fostering a hands-on learning environment. They will prepare to handle real-world data scenarios, from initial ingestion to complex transformations, culminating in reliable, analytical-ready data. Each step will reinforce their understanding and practical skills, equipping them to fully utilize Databricks for effective data solutions.

  • Challenge 1: Data Ingestion and Bronze Layer Creation
    • In this initial challenge, participants will focus on establishing the foundation of their data lake. They will ingest raw data from various Azure sources into Databricks. The primary goal is to land the data in its original, untransformed format, creating a "Bronze" layer in Delta Lake. This challenge emphasizes robust data loading and schema inference.
  • Challenge 2: Data Cleaning and Standardization (Silver Layer)
    • Building upon the Bronze layer, this challenge focuses on cleaning and standardizing the ingested data. Participants will identify and address common data quality issues such as missing values, inconsistencies, and incorrect data types. The cleaned and standardized data will form the ""Silver"" layer in Delta Lake, ready for more complex transformations.
  • Challenge 3: Advanced Data Transformation and Harmonization (Gold Layer Refinement)
    • Building on the previous challenges, participants will now focus on transforming the standardized data (from the Silver layer) into a "Gold" layer that adheres to a common data warehouse schema. This layer will be specifically designed to support the generation of key business reports and analytical insights. The emphasis will be on complex transformations, aggregations, and ensuring data is perfectly shaped for consumption, while orchestrating this dataflow in an automated manner using Databricks notebooks and jobs.
  • Challenge 4: Incremental Processing and Pipeline Robustness
    • As data volumes grow and new data arrive continuously, it becomes critical to process only the changes rather than reloading entire datasets. In this challenge, participants will enhance their existing pipelines to support differential (incremental) data loads into the Bronze, Silver, and Gold layers. They will also focus on making their data pipelines more robust by implementing comprehensive error handling and basic telemetry to monitor the health and performance of the data flow within Databricks.