Skip to content

An end-to-end data pipeline and dashboard assessing the potential risk of AI replacement in various job roles and industries.

License

Notifications You must be signed in to change notification settings

chriskenndy/AI-Job-Risk-Pipeline-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Job Risk Pipeline Project

SQL Python Pandas Faker SQLAlchemy psycopg2 Azure PostgreSQL Jupyter Notebook Azure Data Studio Tableau Public

Project Overview

This project analyzes how artificial intelligence could impact job security across various jobs and industries using an end-to-end ETL (Extract, Transform, Load) pipeline which includes an interactive Tableau dashboard to showcase insights and results. The pipeline uses synthetic job market data attempting to simulate a group being surveyed, ingesting the data into the pipeline is to analyze their responses to explore how common AI tools like ChatGPT, Claude, and others might affect different professions, industries, and experience levels. The main goals of this project: To gain experience building data pipelines with modern data engineering tools but also to have a pipeline that can understand which job roles, industries, and experience levels are most at risk of AI replacement and generate real and useful insights.

Pipeline Flow (ETL)

EXTRACT

Step 1: Fake Data Generation and CSV Export

  • Environment: Jupyter Notebook
  • Tools: Python, Faker, Pandas
  • What I Did:
    • Used Python and the Faker library to generate 1,000 synthetic data entries with information such as name, job title, industry, salary, location, and what "they thought" their risk of replacement by AI was on a scale of 1-5
    • Exported the final product (the dataframe) to a CSV file simulating a dataset of survey responses ready for ingestion into the pipeline

Step 2: Configure Azure PostgreSQL Flexible Server & Connect to Azure Data Studio

  • Environments: Azure Portal, Azure Data Studio
  • What I Did:
    • Provisioned an Azure PostgreSQL Flexible Server with appropriate region, compute, and storage for a free Azure account
    • Set my administrative credentials and enabled public access while also allowing traffic from my IP address using firewall rules.
    • Installed the PostgreSQL extension in Azure Data Studio and connected to my Azure PostgreSQL flexible server using the database connection parameters allowing me to manage the database directly in ADS

TRANSFORM

Step 3: Schema and Table Creation

  • Environment: Azure Data Studio
  • Tools: SQL
  • What I Did:
    • Created a schema and defined a target table based on the structure of the CSV with the synthesized data.
      • Used the appropriate column data types (VARCHAR, INT, etc.) and added constraints.
      • Ensured table was ready to accept clean, transformed data

Step 4: Data Cleaning and Transformation

  • Environment: Azure Data Studio
  • Tools: Python, Pandas
  • What I Did:
    • Used pandas to read the CSV with synthesized data and load the data into a dataframe in Azure Data Studio
    • Cleaned the data:
      • Removed underscores from column names
      • Renamed certain columns to match schema
      • Converted column data types to correct data types matching schema
      • Removed null or duplicate values
    • Enriched the dataset and project as a whole by creating calculated fields (for further insights in the viz step)
      • salary_group, experience_level, work_type

LOAD

Step 5: Load to Azure PostgreSQL Flexible Server in the Cloud

  • Environment: Azure Data Studio
  • Tools: Python, SQLAlchemy, Pandas, SQL
  • What I Did:
    • Established a connection from Azure Data Studio using Python and the SQLAlchemy create_engine function
    • Used Python and the 'create_engine' function from the SQLAlchemy library to establish a connection between my local Mac environment (in Azure Data Studio) and my Azure-hosted PostgreSQL database
    • Used the pandas library to upload the dataframe containing the synthesized and now transformed dataset (orignally from the CSV) to the PostgreSQL database hosted on Azure
      • Also verified successful table population and data integrity
      • Queried the cloud-hosted data with SQL in Azure Data Studio to verifiy data integrity

Step 6: Visualizations and Insights

  • Environment: Tableau Public
  • What I Did:
    • Connected Tableau to Azure PostgreSQL server
    • Published a fully interactive and cohesive dashboard on Tableau Public focusing on AI replacement exposure across jobs and industries
      • Made up of 7 visualizations including 2 KPI-tracking metrics
      • Features bar charts, heat maps, and calculated (KPI) tiles
      • Various filters regarding work-related specifications (experience level, industry, city, etc.)

Tech Stack

  • Languages and Libraries

    • SQL Logo SQL: Schema and Table creation, querying, aggregating
    • Python Logo Python: End-to-end scripting (generation, transformation, loading)
    • Pandas Logo Pandas: Python library for data manipulation and cleaning
    • Faker: Python library for synthetic data generation
    • SQLAlchemy: Python library used to connect ADS to Azure-hosted database
    • psycopg2: Database driver for PostgreSQL
  • Database and Cloud

    • PostgreSQL Logo Azure PostgreSQL Flexible Server: Cloud-based relational database service used to host all transformed job data
    • Azure Logo Azure Data Studio: Database Management with SQL
  • Development and Analysis

    • Jupyter Logo Jupyter Notebook: Used for Python synthetic data generation
    • Azure Logo Azure Data Studio: For SQL development, schema creation, and Python transformation and loading
    • Tableau Public Logo Tableau Public: Visualization and dashboarding
    • OpenAI Logo ChatGPT: Assisted with hands-on learning, project guidance, and code review

Links

Releases

No releases published

Packages

No packages published