AI Job Risk Pipeline Project

Project Overview

This project analyzes how artificial intelligence could impact job security across various jobs and industries using an end-to-end ETL (Extract, Transform, Load) pipeline which includes an interactive Tableau dashboard to showcase insights and results. The pipeline uses synthetic job market data attempting to simulate a group being surveyed, ingesting the data into the pipeline is to analyze their responses to explore how common AI tools like ChatGPT, Claude, and others might affect different professions, industries, and experience levels. The main goals of this project: To gain experience building data pipelines with modern data engineering tools but also to have a pipeline that can understand which job roles, industries, and experience levels are most at risk of AI replacement and generate real and useful insights.

Pipeline Flow (ETL)

EXTRACT

Step 1: Fake Data Generation and CSV Export

Environment: Jupyter Notebook
Tools: Python, Faker, Pandas
What I Did:
- Used Python and the Faker library to generate 1,000 synthetic data entries with information such as name, job title, industry, salary, location, and what "they thought" their risk of replacement by AI was on a scale of 1-5
- Exported the final product (the dataframe) to a CSV file simulating a dataset of survey responses ready for ingestion into the pipeline

Step 2: Configure Azure PostgreSQL Flexible Server & Connect to Azure Data Studio

Environments: Azure Portal, Azure Data Studio
What I Did:
- Provisioned an Azure PostgreSQL Flexible Server with appropriate region, compute, and storage for a free Azure account
- Set my administrative credentials and enabled public access while also allowing traffic from my IP address using firewall rules.
- Installed the PostgreSQL extension in Azure Data Studio and connected to my Azure PostgreSQL flexible server using the database connection parameters allowing me to manage the database directly in ADS

TRANSFORM

Step 3: Schema and Table Creation

Environment: Azure Data Studio
Tools: SQL
What I Did:
- Created a schema and defined a target table based on the structure of the CSV with the synthesized data.
  - Used the appropriate column data types (VARCHAR, INT, etc.) and added constraints.
  - Ensured table was ready to accept clean, transformed data

Step 4: Data Cleaning and Transformation

Environment: Azure Data Studio
Tools: Python, Pandas
What I Did:
- Used pandas to read the CSV with synthesized data and load the data into a dataframe in Azure Data Studio
- Cleaned the data:
  - Removed underscores from column names
  - Renamed certain columns to match schema
  - Converted column data types to correct data types matching schema
  - Removed null or duplicate values
- Enriched the dataset and project as a whole by creating calculated fields (for further insights in the viz step)
  - salary_group, experience_level, work_type

LOAD

Step 5: Load to Azure PostgreSQL Flexible Server in the Cloud

Environment: Azure Data Studio
Tools: Python, SQLAlchemy, Pandas, SQL
What I Did:
- Established a connection from Azure Data Studio using Python and the SQLAlchemy create_engine function
- Used Python and the 'create_engine' function from the SQLAlchemy library to establish a connection between my local Mac environment (in Azure Data Studio) and my Azure-hosted PostgreSQL database
- Used the pandas library to upload the dataframe containing the synthesized and now transformed dataset (orignally from the CSV) to the PostgreSQL database hosted on Azure
  - Also verified successful table population and data integrity
  - Queried the cloud-hosted data with SQL in Azure Data Studio to verifiy data integrity

Step 6: Visualizations and Insights

Environment: Tableau Public
What I Did:
- Connected Tableau to Azure PostgreSQL server
- Published a fully interactive and cohesive dashboard on Tableau Public focusing on AI replacement exposure across jobs and industries
  - Made up of 7 visualizations including 2 KPI-tracking metrics
  - Features bar charts, heat maps, and calculated (KPI) tiles
  - Various filters regarding work-related specifications (experience level, industry, city, etc.)

Tech Stack

Languages and Libraries
- SQL: Schema and Table creation, querying, aggregating
- Python: End-to-end scripting (generation, transformation, loading)
- Pandas: Python library for data manipulation and cleaning
- Faker: Python library for synthetic data generation
- SQLAlchemy: Python library used to connect ADS to Azure-hosted database
- psycopg2: Database driver for PostgreSQL
Database and Cloud
- Azure PostgreSQL Flexible Server: Cloud-based relational database service used to host all transformed job data
- Azure Data Studio: Database Management with SQL
Development and Analysis
- Jupyter Notebook: Used for Python synthetic data generation
- Azure Data Studio: For SQL development, schema creation, and Python transformation and loading
- Tableau Public: Visualization and dashboarding
- ChatGPT: Assisted with hands-on learning, project guidance, and code review

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
data		data
documentation		documentation
pipeline_diagram		pipeline_diagram
scripts		scripts
visualizations		visualizations
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI Job Risk Pipeline Project

Project Overview

Pipeline Flow (ETL)

Tech Stack

Links

About

Uh oh!

Releases

Packages

Languages

License

chriskenndy/AI-Job-Risk-Pipeline-Project

Folders and files

Latest commit

History

Repository files navigation

AI Job Risk Pipeline Project

Project Overview

Pipeline Flow (ETL)

Tech Stack

Links

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages