Skip to content

A cloud-native data pipeline and visualization project analyzing Formula 1 racing data using Azure, Databricks, Delta Lake, Tableau, and Python for insightful EDA and interactive dashboards.

License

Notifications You must be signed in to change notification settings

Smars-Bin-Hu/azure-cloud-datapipeline-EDA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

55 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

1

πŸ‘† click the picture to see the presentation video!

Cloud Native Data Pipeline on Azure Databricks for Exploratory Data Analysis

Sublime's custom image Sublime's custom image

This project presents an end-to-end data pipeline and analytics workflow centered around Formula 1 racing data, with a strong emphasis on exploratory data analysis (EDA) and visualization. It is structured into two core components: cloud-native data engineering and analytical data visualization.

On the data engineering side, we leverage Azure Cloud services to build a scalable and automated data pipeline following the medallion architecture (bronze, silver, gold layers). Raw data is ingested and stored in Azure Data Lake Storage Gen2, processed and transformed using Azure Databricks with PySpark and SparkSQL, and managed through Delta Lake to ensure ACID transactions and schema enforcement. We incorporate Unity Catalog for data governance and access control, while Azure Data Factory orchestrates the workflow to achieve full automation. The architecture demonstrates cloud-native best practices such as decoupled storage and compute, batch-stream unification, and automated job triggeringβ€”effectively realizing a modern Lakehouse design.

On the data analysis and visualization front, we utilize both Tableau and Python for different analytical tasks. Tableau connects directly to the Databricks-backed gold layer, enabling real-time, interactive BI dashboards that cover historical driver and team rankings, national-level aggregations, and top driver trends over time. For deeper statistical insights, Python’s Matplotlib and Seaborn are used to explore multidimensional relationships, such as starting grid vs. final position, fastest lap vs. points, and stability of driver performance across seasons.

By integrating enterprise-grade data infrastructure with intuitive visual storytelling, this project not only showcases advanced data engineering capabilities but also delivers valuable insights into the world of Formula 1 through rich, interactive visualizations.

βœ… Core Feature πŸ”₯ Core Highlights πŸ“¦ Deliverables
1. Azure Data Pipeline Setup - Production-grade pipeline using Azure services
- Step-by-step deployment guide with open-sourced code
- One-click deployment via Azure DevOps .json templates
- Azure DevOps .json templates: /devops/ folder: pipeline configs
- services_architecture.png azure services architecture overview
2. Databricks ETL Workflow - End-to-end ETL with PySpark & SparkSQL
- Medallion architecture modeling (Bronze β†’ Silver β†’ Gold)
- Open-source Databricks Notebooks and scripts
- /src/ and /dbc/azure-cloud-datapipeline-EDA.dbc files for direct import
- medallion_diagram.png architecture overview - Modular ETL: data_ingestion, data_transformation, data_modeling, data_analysis, config, utils
3. Data Orchestration via ADF - Workflow automation with Azure Data Factory
- Scheduled, triggered pipelines
- Full .json export for reproducibility
- /ADF/ folder with pipeline .json files
- Import-ready ADF workflow
- Setup & execution flow in /docs/dev/ADF-development-steps.md
4. BI Dashboard & EDA - Tableau Public dashboard for interactive exploration
- Deep EDA with Python (Matplotlib, Seaborn)
- Accompanying analysis report in PDF format
- Tableau Public link & screenshots
- /visualization/via_python/ADF-development-steps.md for visual/statistical analysis
- /visualization/f1_analysis_report.pdf with insights
5. Documentation & Knowledge Sharing - Azure-native data engineering tutorials
- Concepts explained: Lakehouse, Delta Lake, Unity Catalog, Medallion Layers
- Best practices & reusable code patterns
- /docs/dev/*.md: Step-by-step deployment guides
- /docs/wiki/*.md: Big data & Azure concepts explanations

Project Structure

/bigdata-datawarehouse-project
β”œβ”€β”€ πŸ“„ README.md                           # Project overview and documentation links
β”œβ”€β”€ πŸ“„ LICENSE                             # MIT License file
β”œβ”€β”€ πŸ“ ADF/                                # Azure Data Factory Import-Ready Workflow .json Templates
β”œβ”€β”€ πŸ“ devops/                             # Azure All Services Deployment .json Templates
β”‚   β”œβ”€β”€ πŸ“ ADF_src/                        # Azure Data Factory Import-Ready Templates
β”‚   β”œβ”€β”€ πŸ“ azure_deployment/               # Azure Cloud Services&Resources Import-Ready Templates
β”œβ”€β”€ πŸ“ src/                                # Source code directory
β”‚   β”œβ”€β”€ πŸ“ README.md                       # source code instruction overview
β”‚   β”œβ”€β”€ πŸ“ data_ingestion/                 # Data ingestion layer
β”‚   β”œβ”€β”€ πŸ“ data_transformation/            # Data transformation layer
β”‚   β”‚   β”œβ”€β”€ πŸ“ processed_layer/            # Processed data transformations
β”‚   β”‚   └──    presentation_layer/         # Presentation layer transformations
β”‚   β”œβ”€β”€ πŸ“ data_analysis/                  # for Data analysis and BI visualization
β”‚   β”œβ”€β”€ πŸ“ data_modeling/                  # Data modeling and schema design
β”‚   β”‚   β”œβ”€β”€ πŸ“ env_setup/                  # Environment setup scripts
β”‚   β”‚   β”œβ”€β”€ πŸ“ raw_layer/                  # Raw data layer schemas
β”‚   β”‚   β”œβ”€β”€ πŸ“ processed_layer/            # Processed data layer schemas
β”‚   β”‚   └──    presentation_layer/         # Presentation layer schemas
β”‚   β”œβ”€β”€ πŸ“ config/                         # Configuration files
β”‚   β”‚   └── πŸ“„ configuration.py            # Main configuration settings
β”‚   β”œβ”€β”€ πŸ“ utils/                          # Utility functions and helpers
β”‚   β”‚   β”œβ”€β”€ πŸ“ 2021-03-21/                 # March 21, 2021 dataset
β”‚   β”‚   β”œβ”€β”€ πŸ“ 2021-03-28/                 # March 28, 2021 dataset
β”‚   β”‚   └── πŸ“ 2021-04-18/                 # April 18, 2021 dataset
β”‚   └── πŸ“ demo_code/                      # Demo and learning materials
β”œβ”€β”€ πŸ“ visualization/                      # Data Visualization
β”‚   β”œβ”€β”€ πŸ“ via_python/                     # jupyter notebooks for visualization via python
β”‚   β”œβ”€β”€ πŸ“ via_tableau/                    # tableau dashboard .twb file
β”‚   β”œβ”€β”€ πŸ“ f1_presentation(2021-04-18)/    # data source for BI
β”‚   β”œβ”€β”€ πŸ“ generated_images/               # generated images via python
β”œβ”€β”€ πŸ“ dataset/                            # Sample datasets (for incremental load)
└── πŸ“ docs/                               # Documentation directory
    β”œβ”€β”€ πŸ“„ README.md                       # Documentation overview
    └── πŸ“ doc/
        └── πŸ“ wiki/                       # Technical documentation wiki

Core Deliverables

1. deliverables I

2. deliverables II

3. deliverables III

Tech Stack

This project sets up a high-availability big data platform, including the following components:

Azure Cloud Azure Data Factory Azure Data Factory Unity Catelog Databricks Delta Lake Tableau Python Spark

Components/Services Features Version
Azure Cloud Service Provider -
Azure Data Lake Storage Gen2 Persistent Storage for dataset -
Azure Data Factory ETL pipeline Scheduler -
Python Programming for Core Spark Job Logic -
Apache Spark Distributed Computing 3.3.0
Azure Databricks Cluster Compute Workspace -
Delta Lake Lakehouse Architecture -
Unity Catelog -
Power BI Data Visualization -

Project Documents /docs

1. wiki

Azure Databricks Cluster

Data Access Control

Mounting data lake container to databricks

Data Lakehouse Architecture

License

This project is licensed under the MIT License - see the LICENSE file for details.
Created and maintained by Smars-Bin-Hu.