Course Overview
This hackathon challenges attendees to design, implement, and optimize robust data processing and transformation pipelines on the Databricks platform within the Microsoft Azure ecosystem. Participants will tackle real-world scenarios involving diverse data sources, focusing on efficient ingestion, sophisticated transformations, and preparing data for analytical consumption. The emphasis will be on leveraging Databricks' capabilities for scalable data manipulation and ensuring data quality.
By participating, attendees will gain hands-on expertise in building high-performance, maintainable data pipelines crucial for deriving actionable insights. This practical experience directly translates into increased proficiency in modern data engineering practices, enabling them to unlock the full potential of their data assets and drive data-driven decision-making within their organizations.
Who should attend
- Data Engineers: This is the core target audience. The challenges directly align with their daily tasks, such as building and optimizing ETL/ELT pipelines, managing data lakes, and ensuring data quality.
- Data Analysts: Those who are looking to move beyond simple analysis and gain a deeper understanding of the data pipeline "plumbing". This hackathon will help them understand how data is prepared, leading to more effective and robust analysis.
- Data Scientists: While their primary focus is on modeling, many data scientists are involved in data preparation. This event will help them build more efficient pipelines for feature engineering and data preprocessing, which is a significant part of their workflow.
Prerequisites
To be successful and get the most out of the event, participants should have:
- Relational database knowledge: Understanding of concepts like tables, joins, and SQL.
- Programming experience: Proficiency in a language like Scala or Python
- ETL pipeline concepts: Familiarity with the concept of data transformation with data pipelines is recommended.
- Cloud fundamentals: Familiarity with Azure is recommended.
Course Objectives
This hackathon embodies the modern approach to data processing, leveraging the Databricks platform, powered by innovation in the Apache Spark product family, and running on any of the top cloud hyperscalers. By the end of this hackathon, participants will gain practical skills and a deeper understanding of modern data engineering on Databricks within Azure, specifically to:
- Master data ingestion from diverse sources and establish robust multi-layered Delta Lake architectures (Bronze, Silver, Gold).
- Proficiently perform complex cleaning, standardization, and feature engineering using Spark to prepare high-quality data for analytical consumption.
- Design and implement automated, incremental data loading and transformation jobs to ensure efficiency and data freshness.
- Integrate robust error handling and basic telemetry to build reliable pipelines and effectively monitor their health and performance.