Cloud Native Data Pipeline on Azure Databricks for Exploratory Data Analysis

👆 click the picture to see the presentation video!

Cloud Native Data Pipeline on Azure Databricks for Exploratory Data Analysis

This project presents an end-to-end data pipeline and analytics workflow centered around Formula 1 racing data, with a strong emphasis on exploratory data analysis (EDA) and visualization. It is structured into two core components: cloud-native data engineering and analytical data visualization.

On the data engineering side, we leverage Azure Cloud services to build a scalable and automated data pipeline following the medallion architecture (bronze, silver, gold layers). Raw data is ingested and stored in Azure Data Lake Storage Gen2, processed and transformed using Azure Databricks with PySpark and SparkSQL, and managed through Delta Lake to ensure ACID transactions and schema enforcement. We incorporate Unity Catalog for data governance and access control, while Azure Data Factory orchestrates the workflow to achieve full automation. The architecture demonstrates cloud-native best practices such as decoupled storage and compute, batch-stream unification, and automated job triggering—effectively realizing a modern Lakehouse design.

On the data analysis and visualization front, we utilize both Tableau and Python for different analytical tasks. Tableau connects directly to the Databricks-backed gold layer, enabling real-time, interactive BI dashboards that cover historical driver and team rankings, national-level aggregations, and top driver trends over time. For deeper statistical insights, Python’s Matplotlib and Seaborn are used to explore multidimensional relationships, such as starting grid vs. final position, fastest lap vs. points, and stability of driver performance across seasons.

By integrating enterprise-grade data infrastructure with intuitive visual storytelling, this project not only showcases advanced data engineering capabilities but also delivers valuable insights into the world of Formula 1 through rich, interactive visualizations.

✅ Core Feature	🔥 Core Highlights	📦 Deliverables
1. Azure Data Pipeline Setup	- Production-grade pipeline using Azure services - Step-by-step deployment guide with open-sourced code - One-click deployment via Azure DevOps `.json` templates	- Azure DevOps `.json` templates: `/devops/` folder: pipeline configs - `services_architecture.png` azure services architecture overview
2. Databricks ETL Workflow	- End-to-end ETL with PySpark & SparkSQL - Medallion architecture modeling (Bronze → Silver → Gold) - Open-source Databricks Notebooks and scripts	- `/src/` and `/dbc/azure-cloud-datapipeline-EDA.dbc` files for direct import - `medallion_diagram.png` architecture overview - Modular ETL: data_ingestion, data_transformation, data_modeling, data_analysis, config, utils
3. Data Orchestration via ADF	- Workflow automation with Azure Data Factory - Scheduled, triggered pipelines - Full `.json` export for reproducibility	- `/ADF/` folder with pipeline `.json` files - Import-ready ADF workflow - Setup & execution flow in `/docs/dev/ADF-development-steps.md`
4. BI Dashboard & EDA	- Tableau Public dashboard for interactive exploration - Deep EDA with Python (Matplotlib, Seaborn) - Accompanying analysis report in PDF format	- Tableau Public link & screenshots - `/visualization/via_python/ADF-development-steps.md` for visual/statistical analysis - `/visualization/f1_analysis_report.pdf` with insights
5. Documentation & Knowledge Sharing	- Azure-native data engineering tutorials - Concepts explained: Lakehouse, Delta Lake, Unity Catalog, Medallion Layers - Best practices & reusable code patterns	- `/docs/dev/.md`: Step-by-step deployment guides - `/docs/wiki/.md`: Big data & Azure concepts explanations

Project Structure

/bigdata-datawarehouse-project
├── 📄 README.md                           # Project overview and documentation links
├── 📄 LICENSE                             # MIT License file
├── 📁 ADF/                                # Azure Data Factory Import-Ready Workflow .json Templates
├── 📁 devops/                             # Azure All Services Deployment .json Templates
│   ├── 📁 ADF_src/                        # Azure Data Factory Import-Ready Templates
│   ├── 📁 azure_deployment/               # Azure Cloud Services&Resources Import-Ready Templates
├── 📁 src/                                # Source code directory
│   ├── 📁 README.md                       # source code instruction overview
│   ├── 📁 data_ingestion/                 # Data ingestion layer
│   ├── 📁 data_transformation/            # Data transformation layer
│   │   ├── 📁 processed_layer/            # Processed data transformations
│   │   └──    presentation_layer/         # Presentation layer transformations
│   ├── 📁 data_analysis/                  # for Data analysis and BI visualization
│   ├── 📁 data_modeling/                  # Data modeling and schema design
│   │   ├── 📁 env_setup/                  # Environment setup scripts
│   │   ├── 📁 raw_layer/                  # Raw data layer schemas
│   │   ├── 📁 processed_layer/            # Processed data layer schemas
│   │   └──    presentation_layer/         # Presentation layer schemas
│   ├── 📁 config/                         # Configuration files
│   │   └── 📄 configuration.py            # Main configuration settings
│   ├── 📁 utils/                          # Utility functions and helpers
│   │   ├── 📁 2021-03-21/                 # March 21, 2021 dataset
│   │   ├── 📁 2021-03-28/                 # March 28, 2021 dataset
│   │   └── 📁 2021-04-18/                 # April 18, 2021 dataset
│   └── 📁 demo_code/                      # Demo and learning materials
├── 📁 visualization/                      # Data Visualization
│   ├── 📁 via_python/                     # jupyter notebooks for visualization via python
│   ├── 📁 via_tableau/                    # tableau dashboard .twb file
│   ├── 📁 f1_presentation(2021-04-18)/    # data source for BI
│   ├── 📁 generated_images/               # generated images via python
├── 📁 dataset/                            # Sample datasets (for incremental load)
└── 📁 docs/                               # Documentation directory
    ├── 📄 README.md                       # Documentation overview
    └── 📁 doc/
        └── 📁 wiki/                       # Technical documentation wiki

Core Deliverables

1. deliverables I

2. deliverables II

3. deliverables III

Tech Stack

This project sets up a high-availability big data platform, including the following components:

Components/Services	Features	Version
Azure	Cloud Service Provider	-
Azure Data Lake Storage Gen2	Persistent Storage for dataset	-
Azure Data Factory	ETL pipeline Scheduler	-
Python	Programming for Core Spark Job Logic	-
Apache Spark	Distributed Computing	3.3.0
Azure Databricks	Cluster Compute Workspace	-
Delta Lake	Lakehouse Architecture	-
Unity Catelog		-
Power BI	Data Visualization	-

Source Code Instruction for Use `/src`

Project Documents `/docs`

1. wiki

Azure Databricks Cluster

Data Access Control

Mounting data lake container to databricks

Mounting data lake container to databricks

Data Lakehouse Architecture

License

This project is licensed under the MIT License - see the LICENSE file for details.
Created and maintained by Smars-Bin-Hu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Cloud Native Data Pipeline on Azure Databricks for Exploratory Data Analysis

Project Structure

Core Deliverables

1. deliverables I

2. deliverables II

3. deliverables III

Tech Stack

Source Code Instruction for Use `/src`

Project Documents `/docs`

1. wiki

License

About

Uh oh!

Releases

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
ADF		ADF
dataset		dataset
dbc		dbc
devops		devops
docs		docs
src		src
visualization		visualization
LICENSE		LICENSE
README.md		README.md

License

Smars-Bin-Hu/azure-cloud-datapipeline-EDA

Folders and files

Latest commit

History

Repository files navigation

Cloud Native Data Pipeline on Azure Databricks for Exploratory Data Analysis

Project Structure

Core Deliverables

1. deliverables I

2. deliverables II

3. deliverables III

Tech Stack

Source Code Instruction for Use /src

Project Documents /docs

1. wiki

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Languages

Source Code Instruction for Use `/src`

Project Documents `/docs`