π click the picture to see the presentation video!
This project presents an end-to-end data pipeline and analytics workflow centered around Formula 1 racing data, with a strong emphasis on exploratory data analysis (EDA) and visualization. It is structured into two core components: cloud-native data engineering and analytical data visualization.
On the data engineering side, we leverage Azure Cloud services to build a scalable and automated data pipeline following the medallion architecture (bronze, silver, gold layers). Raw data is ingested and stored in Azure Data Lake Storage Gen2, processed and transformed using Azure Databricks with PySpark and SparkSQL, and managed through Delta Lake to ensure ACID transactions and schema enforcement. We incorporate Unity Catalog for data governance and access control, while Azure Data Factory orchestrates the workflow to achieve full automation. The architecture demonstrates cloud-native best practices such as decoupled storage and compute, batch-stream unification, and automated job triggeringβeffectively realizing a modern Lakehouse design.
On the data analysis and visualization front, we utilize both Tableau and Python for different analytical tasks. Tableau connects directly to the Databricks-backed gold layer, enabling real-time, interactive BI dashboards that cover historical driver and team rankings, national-level aggregations, and top driver trends over time. For deeper statistical insights, Pythonβs Matplotlib and Seaborn are used to explore multidimensional relationships, such as starting grid vs. final position, fastest lap vs. points, and stability of driver performance across seasons.
By integrating enterprise-grade data infrastructure with intuitive visual storytelling, this project not only showcases advanced data engineering capabilities but also delivers valuable insights into the world of Formula 1 through rich, interactive visualizations.
β Core Feature | π₯ Core Highlights | π¦ Deliverables |
---|---|---|
1. Azure Data Pipeline Setup | - Production-grade pipeline using Azure services - Step-by-step deployment guide with open-sourced code - One-click deployment via Azure DevOps .json templates |
- Azure DevOps .json templates: /devops/ folder: pipeline configs - services_architecture.png azure services architecture overview |
2. Databricks ETL Workflow | - End-to-end ETL with PySpark & SparkSQL - Medallion architecture modeling (Bronze β Silver β Gold) - Open-source Databricks Notebooks and scripts |
- /src/ and /dbc/azure-cloud-datapipeline-EDA.dbc files for direct import - medallion_diagram.png architecture overview - Modular ETL: data_ingestion, data_transformation, data_modeling, data_analysis, config, utils |
3. Data Orchestration via ADF | - Workflow automation with Azure Data Factory - Scheduled, triggered pipelines - Full .json export for reproducibility |
- /ADF/ folder with pipeline .json files - Import-ready ADF workflow - Setup & execution flow in /docs/dev/ADF-development-steps.md |
4. BI Dashboard & EDA | - Tableau Public dashboard for interactive exploration - Deep EDA with Python (Matplotlib, Seaborn) - Accompanying analysis report in PDF format |
- Tableau Public link & screenshots - /visualization/via_python/ADF-development-steps.md for visual/statistical analysis - /visualization/f1_analysis_report.pdf with insights |
5. Documentation & Knowledge Sharing | - Azure-native data engineering tutorials - Concepts explained: Lakehouse, Delta Lake, Unity Catalog, Medallion Layers - Best practices & reusable code patterns |
- /docs/dev/*.md : Step-by-step deployment guides - /docs/wiki/*.md : Big data & Azure concepts explanations |
/bigdata-datawarehouse-project
βββ π README.md # Project overview and documentation links
βββ π LICENSE # MIT License file
βββ π ADF/ # Azure Data Factory Import-Ready Workflow .json Templates
βββ π devops/ # Azure All Services Deployment .json Templates
β βββ π ADF_src/ # Azure Data Factory Import-Ready Templates
β βββ π azure_deployment/ # Azure Cloud Services&Resources Import-Ready Templates
βββ π src/ # Source code directory
β βββ π README.md # source code instruction overview
β βββ π data_ingestion/ # Data ingestion layer
β βββ π data_transformation/ # Data transformation layer
β β βββ π processed_layer/ # Processed data transformations
β β βββ presentation_layer/ # Presentation layer transformations
β βββ π data_analysis/ # for Data analysis and BI visualization
β βββ π data_modeling/ # Data modeling and schema design
β β βββ π env_setup/ # Environment setup scripts
β β βββ π raw_layer/ # Raw data layer schemas
β β βββ π processed_layer/ # Processed data layer schemas
β β βββ presentation_layer/ # Presentation layer schemas
β βββ π config/ # Configuration files
β β βββ π configuration.py # Main configuration settings
β βββ π utils/ # Utility functions and helpers
β β βββ π 2021-03-21/ # March 21, 2021 dataset
β β βββ π 2021-03-28/ # March 28, 2021 dataset
β β βββ π 2021-04-18/ # April 18, 2021 dataset
β βββ π demo_code/ # Demo and learning materials
βββ π visualization/ # Data Visualization
β βββ π via_python/ # jupyter notebooks for visualization via python
β βββ π via_tableau/ # tableau dashboard .twb file
β βββ π f1_presentation(2021-04-18)/ # data source for BI
β βββ π generated_images/ # generated images via python
βββ π dataset/ # Sample datasets (for incremental load)
βββ π docs/ # Documentation directory
βββ π README.md # Documentation overview
βββ π doc/
βββ π wiki/ # Technical documentation wiki
This project sets up a high-availability big data platform, including the following components:
Components/Services | Features | Version |
---|---|---|
Azure | Cloud Service Provider | - |
Azure Data Lake Storage Gen2 | Persistent Storage for dataset | - |
Azure Data Factory | ETL pipeline Scheduler | - |
Python | Programming for Core Spark Job Logic | - |
Apache Spark | Distributed Computing | 3.3.0 |
Azure Databricks | Cluster Compute Workspace | - |
Delta Lake | Lakehouse Architecture | - |
Unity Catelog | - | |
Power BI | Data Visualization | - |
Azure Databricks Cluster
- Databricks Cluster
- Databricks Cluster Configuration
- Databricks Cluster Pool
- Azure Databricks Pricing and Cost Control
- Azure Databricks Utilities
Data Access Control
- Authentication Configuration
- Authentication Service Principal
- Azure Databricks Cluster Scoped Crendentials
Mounting data lake container to databricks
Data Lakehouse Architecture
This project is licensed under the MIT License - see the LICENSE file for details.
Created and maintained by Smars-Bin-Hu.