Predicting Diabetes Risk Using Machine Learning
This project analyzes patient health metrics to predict diabetes risk using a modular, production-style machine learning pipeline.
The workflow includes:
- Data preprocessing and cleaning
- Exploratory data analysis (EDA)
- Model training using a recall-optimized Random Forest
- Standardized model evaluation
- Optional command-line prediction generation
The objective is to demonstrate an analytics engineering approach that is reproducible, interpretable, and clinically meaningful, especially for use cases where reducing false negatives matters.
Source: Public diabetes dataset inspired by the Pima Indians dataset
Description: Includes features such as glucose, BMI, blood pressure, insulin, age, diabetes pedigree function, and an outcome label indicating diabetes status.
- Explore patient health metrics and identify meaningful predictors.
- Build an ML classifier optimized for high recall (minimizing false negatives).
- Use a modular, production-inspired project structure.
- Provide clear evaluation metrics and interpretable insights.
- Develop a scalable baseline for future healthcare analytics workflows.
- Converted non-physiological zero values to missing values
- Applied median imputation across relevant fields
- Centralized all cleaning logic in
src/data/clean_data.py
- Investigated variable distributions, correlations, and feature interactions
- Identified glucose, BMI, age, and pedigree function as key predictors
- Restricted notebooks to exploratory purposes only; production logic is modular
- Used a
RandomForestClassifierwith class weighting to address imbalance - Performed hyperparameter tuning via GridSearchCV (scoring = recall)
- Implemented training and evaluation pipelines in
src/models/ - Saved the final trained model to the
models/directory
Tools provided in src/visualization/ include:
- ROC curve generation
- Feature importance visualization
- Confusion matrix plotting
- Python
- pandas
- numpy
- scikit-learn
- matplotlib
- joblib
- argparse
- Jupyter Notebook
- Modular analytics engineering structure
- Accuracy: 0.747
- Precision: 0.632
- Recall: 0.667
- F1 Score: 0.649
- ROC AUC: 0.821
- Recall (0.667): The model captures most true diabetes cases, aligning with the priority of reducing false negatives.
- ROC AUC (0.821): Indicates strong class separability and reliable predictive discrimination.
- Precision and F1 remain balanced while recall is prioritized.
Earlier experimental approaches—threshold shifting, SMOTE oversampling, and initial hyperparameter tuning—helped guide the final implementation.
The structured pipeline improves clarity, reproducibility, and engineering quality without sacrificing clinical relevance.
Run the following command:
python main.py train
This handles data cleaning, splitting, hyperparameter tuning, model evaluation, and saving the final model.
To create predictions:
python main.py predict --input data/raw/diabetes.csv --output predictions.csv
The output file will contain a new column named prediction.
- Decision threshold tuning for recall/precision optimization
- Optional SMOTE or alternative class balancing
- SHAP for interpretability
- FastAPI for real-time scoring
- MLflow for experiment tracking
- Unit testing and CI/CD pipeline integration
Julian Charlan Kelly
Analytics Engineer / Data Engineer
Los Angeles, CA