A modular machine learning pipeline designed to predict loan default risk using structured tabular data. The system reflects industry-standard approaches used in credit scoring and financial risk assessment.
- Objective: Binary classification of loan default risk
- Dataset: Kaggle Credit Risk Dataset
- Target Variable:
loan_status
- Features: Borrower demographics, loan characteristics, credit history
Modularized for clarity and extensibility:
load_data.py
: Downloads and loads the datasetpreprocess.py
: Handles missing values, encodes features, removes outlierstrain.py
: Trains models using Scikit-learn Pipelines with scalingevaluate.py
: Generates performance metrics and visualizationsinference.py
: Loads saved models for predictionscripts/run_pipeline.py
: Executes the full end-to-end workflow
Key data cleaning and feature engineering steps include:
-
Missing Values:
- Imputed missing employment length with
0
and flagged with a separate binary indicator - Dropped rows missing loan interest rates due to contextual inconsistency
- Imputed missing employment length with
-
Outlier Removal:
- Removed unrealistic ages (>122)
- Excluded records indicating employment before age 13
-
Feature Encoding:
- One-hot encoded nominal features (
person_home_ownership
,loan_intent
,cb_person_default_on_file
) - Ordinally encoded
loan_grade
(A
–G
→ 1–7)
- One-hot encoded nominal features (
-
Stratified Sampling:
- Ensured class distribution was preserved in training and testing sets
- Logistic Regression
- Random Forest
- Gradient Boosting
- Stratified train/test split to preserve class balance
- Metrics reported: Accuracy, Precision, Recall, F1 Score, ROC AUC
- Visual outputs: Confusion matrix, ROC curve, Precision-Recall curve
- Fully modular and reproducible implementation using Scikit-learn Pipelines
- Incorporates best practices for preprocessing, tuning, and model evaluation
- Suited for credit risk prediction and similar tabular classification tasks
# Run the complete training and evaluation pipeline
python -m scripts.run_pipeline