Automated serverless ETL pipeline built on AWS.In this project, the transformation of CSV data ingested into S3 is performed using AWS Lambda to trigger AWS Glue Workflow containing Crawlers and Spark ETL jobs.
This project implements a robust and automated Extract, Transform, Load (ETL) pipeline on AWS. It is designed to process new CSV files uploaded to an S3 bucket, transform them into a standardised Parquet format with selected columns, and then load the refined data into an Amazon Redshift data warehouse for analytical querying. The entire process is event-driven, ensuring efficient and timely data processing.
The pipeline leverages several AWS services orchestrated into a seamless workflow:
- S3 Event Trigger: An S3 bucket (
sales-data-etl-project) is configured to emit an event via EventBridge whenever a new CSV file is uploaded to itsinput/prefix. - AWS Lambda Activation: This S3 event triggers an AWS Lambda function (
start-workflow-function). - Glue Workflow Initiation: The Lambda function initiates an AWS Glue Workflow (
sales-data-Workflow). - Glue Workflow Steps:
- Crawler 1 (Input CSV): A Glue Crawler (
csv-data-reader) crawls theinput/folder of the S3 bucket to discover the schema of the newly added CSV file. - Glue Job 1 (CSV to Parquet): A Glue ETL Job (
csv-to-parquet-job) reads the CSV data, checks Null Values, selects specific columns, and transforms the data into Parquet format. The processed data is then saved into anoutput/folder of the same S3 bucket (sales-data-etl-project). - Crawler 2 (Output Parquet): Another Glue Crawler (
parquet-data-reader) crawls theoutput/folder to infer the schema of the newly generated Parquet files. - Glue Job 2 (Parquet to Redshift): A second Glue ETL Job (
redshift-loader-job) reads the Parquet data from theoutput/folder and loads it into a specified table in your Amazon Redshift cluster (sales-data-workgroup).
- Crawler 1 (Input CSV): A Glue Crawler (
- SNS Notifications: Amazon SNS is integrated to provide notifications for the Lambda function execution status, as well as the completion and failure states of both Glue Crawlers and Glue Jobs within the workflow.
- Amazon S3: Input and output data storage.
- EventBridge: Rules to invoke lambda function and publish in sns topic.
- AWS Lambda: Event-driven function to trigger the Glue workflow.
- AWS Glue:
- Crawlers: Schema discovery for CSV and Parquet.
- Jobs: Data transformation (CSV to Parquet) and loading (Parquet to Redshift).
- Workflow: Orchestration of crawlers and jobs.
- Amazon Redshift: Data warehouse for final analytical data.
- Amazon SNS: Notification service for operational alerts.
- AWS IAM: For managing permissions between services.
- Amazon CloudWatch: For managing Logs
Before deploying this project, ensure you have the following:
- An AWS Account with administrative access or appropriate IAM permissions.
- Basic understanding of AWS S3, EventBridge, VPC Endpoint, SNS, IAM, CloudWatch, Lambda, Glue, and Redshift.
-
Create S3 Buckets:
- Project Bucket Name: 'sales-data-etl-project'
- Input Bucket:
sales-data-etl-project/input/ - Output Bucket:
sales-data-etl-project/output/ - Glue Scripts Bucket:
aws-glue-assets/scripts/
-
Create IAM Roles:
- Create IAM Role for Lambda ( Glue and CloudWatch permissions).
- Create IAM Role for Crawlers(with S3 read/write, Glue permissions,CloudWatch permissions).
- Create IAM Role for job 1:(with s3 read/write,CloudWatch and specific glue permissions).
- Create IAM Role for job 2 :(with s3 read/write ,CloudWatch,SecretManager,KMS,VPC and specific glue permissions).
- Create IAM Role for Redshift:(with s3 read/write,KMS permissions).
- create IAM Role for EventBridge Rules (with Lambda and SNS permissions).
-
Configure Lambda Function:
- Create Lambda function (
start-workflow-function). - Link it to the S3 eventBridge Rule
start-lambda-rule(Input Bucketinput/prefix). - Set runtime, handler, and assign Lambda IAM Role.
- Create Lambda function (
-
Create Glue Crawlers:
- Input CSV Crawler:
- Create Crawler (
csv-data-reader). - DataSource:
s3://sales-data-etl-project/input/.Assign Crawler IAM Role(Role_for_crawler). - Output and Scheduling :Target DataBase (
db-01) and On demand scheduling. - Output Parquet Crawler:
- Create Crawler (
parquet-data-reader). - Data Source :
s3://sales-data-etl-project/output/. Assign Crawler IAM Role(Role_for_crawler). - Output and Scheduling :Target DataBase (
db-01)
-
Create Glue Jobs:
- CSV to Parquet Job:
- Create job :(
csv-to-parquet-job).Assign Glue IAM Role. Configure arguments for input/output paths. - Parquet to Redshift Job:
- Create job :(
redshift-loader-job).Assign Glue IAM Role. Configure arguments for Redshift connection, table name.
-
Create Glue Workflow:
- Define a workflow (
sales-data-workflow) that orchestrates the crawlers and jobs in the correct sequence. - Include events/triggers between steps (e.g., Crawler 1 finishes -> Job 1 starts).
- Define a workflow (
-
Configure SNS Notifications:
- Create SNS topics for Lambda status, Glue crawlers and jobs success/failure.
- Configure Lambda, Glue jobs, and crawlers to publish to these SNS topics.
-
Create EventBridge Rules:
- create EventBridge Rule: (
start-lambda-rule,crawler-01-rule,job-01-rule,crawler-02-rule,job-02-rule).With appropriate pattern matching and respective IAM role.
- create EventBridge Rule: (
-
Set up Redshift:
- Ensure your Redshift cluster is running and accessible from Glue via Glue Connection(
SalesDataRedshift_connection). - Create the target table (
sales_data_table)in Redshift that theredshift-loader-jobwill load data into.
- Ensure your Redshift cluster is running and accessible from Glue via Glue Connection(
-
Prepare a CSV File: Ensure your CSV file has the expected columns.
-
Upload to S3: Upload your CSV file to the S3 input bucket's
input/prefix:s3://sales-data-etl-project/input/SampleSalesData.csv -
Monitor Progress:
- Check AWS CloudWatch Logs for your Lambda function.
- Monitor the Glue Workflow execution in the AWS Glue console.
- Check your subscribed SNS endpoints for notifications on job status.
-
Verify Data in Redshift: Once the Glue workflow completes successfully, query your Redshift table to confirm the data has been loaded.
SELECT * FROM sales_data_table LIMIT 10;
