Healthcare Analytics Project

Healthcare Analytics Project

Predicting Diabetes Risk Using Machine Learning

Overview

This project analyzes patient health metrics to predict diabetes risk using a modular, production-style machine learning pipeline.
The workflow includes:

Data preprocessing and cleaning
Exploratory data analysis (EDA)
Model training using a recall-optimized Random Forest
Standardized model evaluation
Optional command-line prediction generation

The objective is to demonstrate an analytics engineering approach that is reproducible, interpretable, and clinically meaningful, especially for use cases where reducing false negatives matters.

Dataset

Source: Public diabetes dataset inspired by the Pima Indians dataset
Description: Includes features such as glucose, BMI, blood pressure, insulin, age, diabetes pedigree function, and an outcome label indicating diabetes status.

Objectives

Explore patient health metrics and identify meaningful predictors.
Build an ML classifier optimized for high recall (minimizing false negatives).
Use a modular, production-inspired project structure.
Provide clear evaluation metrics and interpretable insights.
Develop a scalable baseline for future healthcare analytics workflows.

Project Highlights

1. Data Preprocessing

Converted non-physiological zero values to missing values
Applied median imputation across relevant fields
Centralized all cleaning logic in src/data/clean_data.py

2. Exploratory Data Analysis (EDA)

Investigated variable distributions, correlations, and feature interactions
Identified glucose, BMI, age, and pedigree function as key predictors
Restricted notebooks to exploratory purposes only; production logic is modular

3. Machine Learning Model

Used a RandomForestClassifier with class weighting to address imbalance
Performed hyperparameter tuning via GridSearchCV (scoring = recall)
Implemented training and evaluation pipelines in src/models/
Saved the final trained model to the models/ directory

4. Visualization

Tools provided in src/visualization/ include:

ROC curve generation
Feature importance visualization
Confusion matrix plotting

Tools and Technologies

Python
pandas
numpy
scikit-learn
matplotlib
joblib
argparse
Jupyter Notebook
Modular analytics engineering structure

Results

Final Model Performance (Hold-Out Test Set)

Accuracy: 0.747
Precision: 0.632
Recall: 0.667
F1 Score: 0.649
ROC AUC: 0.821

Interpretation

Recall (0.667): The model captures most true diabetes cases, aligning with the priority of reducing false negatives.
ROC AUC (0.821): Indicates strong class separability and reliable predictive discrimination.
Precision and F1 remain balanced while recall is prioritized.

Earlier experimental approaches—threshold shifting, SMOTE oversampling, and initial hyperparameter tuning—helped guide the final implementation.
The structured pipeline improves clarity, reproducibility, and engineering quality without sacrificing clinical relevance.

Pipeline Usage

Training the Model

Run the following command:

python main.py train

This handles data cleaning, splitting, hyperparameter tuning, model evaluation, and saving the final model.

Generating Predictions

To create predictions:

python main.py predict --input data/raw/diabetes.csv --output predictions.csv

The output file will contain a new column named prediction.

Future Improvements

Decision threshold tuning for recall/precision optimization
Optional SMOTE or alternative class balancing
SHAP for interpretability
FastAPI for real-time scoring
MLflow for experiment tracking
Unit testing and CI/CD pipeline integration

Author

Julian Charlan Kelly
Analytics Engineer / Data Engineer
Los Angeles, CA

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.idea		.idea
data		data
models		models
notebooks		notebooks
reports		reports
src		src
.DS_Store		.DS_Store
README.md		README.md
load_model_test.py		load_model_test.py
main.py		main.py
predictions.csv		predictions.csv
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Healthcare Analytics Project

Overview

Dataset

Objectives

Project Highlights

1. Data Preprocessing

2. Exploratory Data Analysis (EDA)

3. Machine Learning Model

4. Visualization

Tools and Technologies

Results

Final Model Performance (Hold-Out Test Set)

Interpretation

Pipeline Usage

Training the Model

Generating Predictions

Future Improvements

Author

About

Uh oh!

Releases

Packages

Languages

JulianCKelly/healthcare-analytics

Folders and files

Latest commit

History

Repository files navigation

Healthcare Analytics Project

Overview

Dataset

Objectives

Project Highlights

1. Data Preprocessing

2. Exploratory Data Analysis (EDA)

3. Machine Learning Model

4. Visualization

Tools and Technologies

Results

Final Model Performance (Hold-Out Test Set)

Interpretation

Pipeline Usage

Training the Model

Generating Predictions

Future Improvements

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages