Aws-S3-to-Redshif-Etl-pipeline

Automated serverless ETL pipeline built on AWS.In this project, the transformation of CSV data ingested into S3 is performed using AWS Lambda to trigger AWS Glue Workflow containing Crawlers and Spark ETL jobs.

Project Overview

This project implements a robust and automated Extract, Transform, Load (ETL) pipeline on AWS. It is designed to process new CSV files uploaded to an S3 bucket, transform them into a standardised Parquet format with selected columns, and then load the refined data into an Amazon Redshift data warehouse for analytical querying. The entire process is event-driven, ensuring efficient and timely data processing.

Architecture

The pipeline leverages several AWS services orchestrated into a seamless workflow:

S3 Event Trigger: An S3 bucket (sales-data-etl-project) is configured to emit an event via EventBridge whenever a new CSV file is uploaded to its input/ prefix.
AWS Lambda Activation: This S3 event triggers an AWS Lambda function (start-workflow-function).
Glue Workflow Initiation: The Lambda function initiates an AWS Glue Workflow (sales-data-Workflow).
Glue Workflow Steps:
- Crawler 1 (Input CSV): A Glue Crawler (csv-data-reader) crawls the input/ folder of the S3 bucket to discover the schema of the newly added CSV file.
- Glue Job 1 (CSV to Parquet): A Glue ETL Job (csv-to-parquet-job) reads the CSV data, checks Null Values, selects specific columns, and transforms the data into Parquet format. The processed data is then saved into an output/ folder of the same S3 bucket (sales-data-etl-project).
- Crawler 2 (Output Parquet): Another Glue Crawler (parquet-data-reader) crawls the output/ folder to infer the schema of the newly generated Parquet files.
- Glue Job 2 (Parquet to Redshift): A second Glue ETL Job (redshift-loader-job) reads the Parquet data from the output/ folder and loads it into a specified table in your Amazon Redshift cluster (sales-data-workgroup).
SNS Notifications: Amazon SNS is integrated to provide notifications for the Lambda function execution status, as well as the completion and failure states of both Glue Crawlers and Glue Jobs within the workflow.

Architecture Diagram

AWS Services Used

Amazon S3: Input and output data storage.
EventBridge: Rules to invoke lambda function and publish in sns topic.
AWS Lambda: Event-driven function to trigger the Glue workflow.
AWS Glue:
- Crawlers: Schema discovery for CSV and Parquet.
- Jobs: Data transformation (CSV to Parquet) and loading (Parquet to Redshift).
- Workflow: Orchestration of crawlers and jobs.
Amazon Redshift: Data warehouse for final analytical data.
Amazon SNS: Notification service for operational alerts.
AWS IAM: For managing permissions between services.
Amazon CloudWatch: For managing Logs

Prerequisites

Before deploying this project, ensure you have the following:

An AWS Account with administrative access or appropriate IAM permissions.
Basic understanding of AWS S3, EventBridge, VPC Endpoint, SNS, IAM, CloudWatch, Lambda, Glue, and Redshift.

Setup and Deployment

Create S3 Buckets:
- Project Bucket Name: 'sales-data-etl-project'
- Input Bucket: sales-data-etl-project/input/
- Output Bucket: sales-data-etl-project/output/
- Glue Scripts Bucket: aws-glue-assets/scripts/
Create IAM Roles:
- Create IAM Role for Lambda ( Glue and CloudWatch permissions).
- Create IAM Role for Crawlers(with S3 read/write, Glue permissions,CloudWatch permissions).
- Create IAM Role for job 1:(with s3 read/write,CloudWatch and specific glue permissions).
- Create IAM Role for job 2 :(with s3 read/write ,CloudWatch,SecretManager,KMS,VPC and specific glue permissions).
- Create IAM Role for Redshift:(with s3 read/write,KMS permissions).
- create IAM Role for EventBridge Rules (with Lambda and SNS permissions).
Configure Lambda Function:
- Create Lambda function (start-workflow-function).
- Link it to the S3 eventBridge Rule start-lambda-rule (Input Bucket input/ prefix).
- Set runtime, handler, and assign Lambda IAM Role.
Create Glue Crawlers:
- Input CSV Crawler:
- Create Crawler (csv-data-reader).
- DataSource:s3://sales-data-etl-project/input/.Assign Crawler IAM Role(Role_for_crawler).
- Output and Scheduling :Target DataBase (db-01) and On demand scheduling.
- Output Parquet Crawler:
- Create Crawler (parquet-data-reader).
- Data Source : s3://sales-data-etl-project/output/. Assign Crawler IAM Role(Role_for_crawler).
- Output and Scheduling :Target DataBase (db-01)
Create Glue Jobs:
- CSV to Parquet Job:
- Create job :(csv-to-parquet-job).Assign Glue IAM Role. Configure arguments for input/output paths.
- Parquet to Redshift Job:
- Create job :(redshift-loader-job).Assign Glue IAM Role. Configure arguments for Redshift connection, table name.
Create Glue Workflow:
- Define a workflow (sales-data-workflow) that orchestrates the crawlers and jobs in the correct sequence.
- Include events/triggers between steps (e.g., Crawler 1 finishes -> Job 1 starts).
Configure SNS Notifications:
- Create SNS topics for Lambda status, Glue crawlers and jobs success/failure.
- Configure Lambda, Glue jobs, and crawlers to publish to these SNS topics.
Create EventBridge Rules:
- create EventBridge Rule: (start-lambda-rule,crawler-01-rule,job-01-rule,crawler-02-rule,job-02-rule).With appropriate pattern matching and respective IAM role.
Set up Redshift:
- Ensure your Redshift cluster is running and accessible from Glue via Glue Connection(SalesDataRedshift_connection).
- Create the target table (sales_data_table)in Redshift that the redshift-loader-job will load data into.

Usage

Prepare a CSV File: Ensure your CSV file has the expected columns.
Upload to S3: Upload your CSV file to the S3 input bucket's input/ prefix: s3://sales-data-etl-project/input/SampleSalesData.csv
Monitor Progress:
- Check AWS CloudWatch Logs for your Lambda function.
- Monitor the Glue Workflow execution in the AWS Glue console.
- Check your subscribed SNS endpoints for notifications on job status.
Verify Data in Redshift: Once the Glue workflow completes successfully, query your Redshift table to confirm the data has been loaded.
```
SELECT * FROM sales_data_table LIMIT 10;
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Aws-S3-to-Redshif-Etl-pipeline

Project Overview

Architecture

Architecture Diagram

AWS Services Used

Prerequisites

Setup and Deployment

Usage

About

Uh oh!

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
EventBridge rules		EventBridge rules
Glue		Glue
Redshift		Redshift
files		files
lambda function		lambda function
README.md		README.md

kalyani-ks/AWS-S3-to-Redshift-ETL-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Aws-S3-to-Redshif-Etl-pipeline

Project Overview

Architecture

Architecture Diagram

AWS Services Used

Prerequisites

Setup and Deployment

Usage

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages