Sentiment Analysis: End-to-End ML Pipeline with Cloud-Native Deployment

A full-stack, production-grade machine learning project for sentiment analysis, featuring robust data pipelines, experiment tracking, cloud storage, containerized deployment, and full-stack monitoring.

Project Overview

A scalable, cloud-native ML pipeline for sentiment analysis, with:

Automated data versioning and pipelines (DVC)
Experiment tracking and model registry (MLflow + DagsHub)
Cloud storage (AWS S3)
Containerized web app (Flask + Docker)
Orchestrated deployment (Kubernetes on AWS EKS)
Monitoring and dashboards (Prometheus + Grafana)
CI/CD automation (GitHub Actions)

Architecture & Flow

Data Pipeline: DVC orchestrates ingestion, preprocessing, feature engineering, model training, evaluation, and registration.
Cloud Storage: Data and models are versioned and synced to AWS S3.
Experiment Tracking: MLflow (hosted on DagsHub) logs parameters, metrics, and artifacts.
Model Registry: MLflow Model Registry (via DagsHub) manages model lifecycle.
Web App: Flask app for real-time predictions, containerized with Docker.
CI/CD: GitHub Actions automates testing, Docker build, and deployment.
Deployment: Docker image pushed to AWS ECR, deployed to AWS EKS (Kubernetes).
Monitoring: Prometheus scrapes metrics from Flask app; Grafana visualizes them.

Key Technologies

Python, Flask, scikit-learn, pandas, nltk
DVC for pipeline/data versioning
MLflow + DagsHub for experiment tracking/model registry
AWS S3 for data/model storage
Docker for containerization
Kubernetes (EKS) for orchestration
Prometheus for metrics scraping
Grafana for dashboarding
GitHub Actions for CI/CD

Architecture Flowchart

flowchart TD
    subgraph Data_and_ML_Pipeline
        A["Data Ingestion (DVC, src/data/data_ingestion.py)"] --> B["Data Preprocessing (DVC, src/data/data_preprocessing.py)"]
        B --> C["Feature Engineering (DVC, src/features/feature_engineering.py)"]
        C --> D["Model Training (DVC, src/model/model_building.py)"]
        D --> E["Model Evaluation (DVC, src/model/model_evaluation.py)"]
        E --> F["Model Registration (MLflow/DagsHub, src/model/register_model.py)"]
        F --> G["Model Promotion (MLflow/DagsHub, scripts/promote_model.py)"]
    end
    G --> H["Docker Build & Push (GitHub Actions, Docker, ECR)"]
    H --> I["Kubernetes Deploy (EKS, deployment.yaml)"]
    I --> J["Flask Web App (/predict, /metrics)"]
    J --> K["Prometheus (Scrapes /metrics)"]
    K --> L["Grafana (Dashboards)"]
    J --> M["User Input (Web UI)"]
    M --> N["Prediction (MLflow Model)"]
    N --> O["Result Displayed"]
    A -.-> P["S3 Data Sync (Makefile, boto3)"]
    D -.-> Q["Experiment Tracking (MLflow/DagsHub)"]
    F -.-> R["Model Registry (MLflow/DagsHub)"]

Project Structure

Sentiment-Analysis/
├── flask_app/                # Flask web app for sentiment prediction
│   ├── app.py                # Main Flask application
│   ├── load_model_test.py    # Model loading test script
│   ├── preprocessing_utility.py # Text preprocessing utilities
│   ├── requirements.txt      # Flask app dependencies
│   └── templates/
│       └── index.html        # Web UI template
│
├── src/                      # Core Python package
│   ├── connections/          # Cloud and DB connection utilities
│   │   ├── s3_connection.py  # AWS S3 integration
│   │   ├── ssms_connection.py# SQL Server integration
│   │   └── config.json
│   ├── data/                 # Data pipeline scripts
│   ├── features/             # Feature engineering
│   ├── logger/               # Logging utilities
│   ├── model/                # Model training, evaluation, registration
│   └── visualization/        # Visualization scripts
│
├── notebooks/                # Jupyter notebooks and experiments
├── models/                   # Trained models and vectorizers
├── scripts/                  # Utility scripts (e.g., promote_model.py)
├── tests/                    # Unit and integration tests
├── docs/                     # Sphinx documentation
├── deployment.yaml           # Kubernetes deployment/service
├── dvc.yaml, dvc.lock        # DVC pipeline files
├── requirements.txt          # Project-wide dependencies
├── Dockerfile                # Docker build for Flask app
├── Makefile                  # Automation commands (S3 sync, etc.)
├── .dvc/config               # DVC remote config (S3)
├── .github/workflows/        # GitHub Actions CI/CD
└── README.md                 # (You are here!)

Setup & Installation

1. Clone the Repository

git clone <your-repo-url>
cd Sentiment-Analysis

2. Install Dependencies

For the full project:
```
pip install -r requirements.txt
```

For the Flask app only:

cd flask_app
pip install -r requirements.txt

3. Configure Cloud & Tracking

Set AWS credentials (for S3, ECR, EKS):
- AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION
Set DagsHub/MLflow token:
- CAPSTONE_TEST

DVC remote setup (S3):

dvc remote add -d myremote s3://<your-bucket>
dvc remote modify myremote access_key_id <AWS_ACCESS_KEY_ID>
dvc remote modify myremote secret_access_key <AWS_SECRET_ACCESS_KEY>

(Optional) Use a .env file for local secrets.

Data & Model Pipeline

DVC: Run the full pipeline with dvc repro.
S3: Data and models are versioned and synced (see Makefile for sync commands).
MLflow/DagsHub: All experiments, metrics, and models are logged and tracked.
Model Registry: Models are promoted and managed via MLflow Model Registry (DagsHub UI).

Web Application

Flask app: Real-time predictions, Prometheus metrics at /metrics.
Endpoints:
- / : Home page with input form.
- /predict : Handles prediction requests.
- /metrics : Exposes Prometheus metrics for scraping.
Docker:
- Build: docker build -t flask-app .
- Run: docker run -p 5000:5000 flask-app

Deployment

Docker: Image built and pushed to AWS ECR (see GitHub Actions workflow).
Kubernetes (EKS):
- Deploy with: kubectl apply -f deployment.yaml
- Service: LoadBalancer exposes the app on port 5000.
- Secrets: K8s secrets for tokens/keys (see workflow and deployment.yaml).
CI/CD: GitHub Actions automates build, test, push, and deploy.

Monitoring & Observability

Prometheus: Scrapes /metrics endpoint from Flask app.
Grafana: Visualizes metrics (set up Prometheus as a data source, import dashboards).

Sample Prometheus scrape config:

scrape_configs:
  - job_name: 'flask-app'
    static_configs:
      - targets: ['<flask-app-service-ip>:5000']

Grafana: Import Prometheus as a data source, create dashboards for app and model metrics.

CI/CD

GitHub Actions:
- Runs tests, builds Docker image, pushes to ECR, deploys to EKS.
- See .github/workflows/ci.yaml for full pipeline.

Testing

Unit/integration tests: pytest tests/
CI: Automated in GitHub Actions.

Documentation

Sphinx: Build with make html in docs/.
See docs/getting-started.rst for setup instructions.

Contributing & License

Fork the repo, create a branch, add/commit changes, ensure all tests pass, and submit a PR.
Licensed under the MIT License.

Appendix

Environment Variables

CAPSTONE_TEST (DagsHub/MLflow token)
AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION (for S3, ECR, EKS)
(Optional) SQL Server credentials for data ingestion

Prometheus & Grafana Setup (Local Dev)

Use Docker Compose or manual setup to run Prometheus and Grafana locally.
Example Prometheus config provided above.
Grafana: Add Prometheus as a data source, import dashboards (JSON available in docs/ or provide your own).

Troubleshooting

DVC S3 issues: Check AWS credentials and bucket permissions.
Kubernetes deployment: Ensure ECR image is public or node has pull access.
Prometheus/Grafana: Ensure network access to Flask app /metrics endpoint.

For any questions or issues, please open an issue or discussion in the repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sentiment Analysis: End-to-End ML Pipeline with Cloud-Native Deployment

Table of Contents

Project Overview

Architecture & Flow

Key Technologies

Architecture Flowchart

Project Structure

Setup & Installation

1. Clone the Repository

2. Install Dependencies

3. Configure Cloud & Tracking

Data & Model Pipeline

Web Application

Deployment

Monitoring & Observability

CI/CD

Testing

Documentation

Contributing & License

Appendix

Environment Variables

Prometheus & Grafana Setup (Local Dev)

Troubleshooting

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.dvc		.dvc
.github/workflows		.github/workflows
docs		docs
flask_app		flask_app
mlruns/0		mlruns/0
models		models
notebooks		notebooks
references		references
reports		reports
scripts		scripts
src		src
tests		tests
.dvcignore		.dvcignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
deployment.yaml		deployment.yaml
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
params.yaml		params.yaml
requirements.txt		requirements.txt
setup.py		setup.py
test_environment.py		test_environment.py
tox.ini		tox.ini

License

extremecoder-rgb/Sentiment-Analysis-End-to-End-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

Sentiment Analysis: End-to-End ML Pipeline with Cloud-Native Deployment

Table of Contents

Project Overview

Architecture & Flow

Key Technologies

Architecture Flowchart

Project Structure

Setup & Installation

1. Clone the Repository

2. Install Dependencies

3. Configure Cloud & Tracking

Data & Model Pipeline

Web Application

Deployment

Monitoring & Observability

CI/CD

Testing

Documentation

Contributing & License

Appendix

Environment Variables

Prometheus & Grafana Setup (Local Dev)

Troubleshooting

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages