Automating and optimizing the loan recovery lifecycle by modeling repayment behavior using diverse data
Springboard Infosys Virtual Internship Program
- Project Overview
- Academic Information
- Objectives
- Dataset Features
- Dataset Sources
- Project Structure
- Technologies Used
- Installation & Setup
- Using the Application
- Business Impact
- Model Performance
- Key Insights
- Project Workflow
- Future Enhancements
- Troubleshooting
- Contributing
- License
- Contact
CreditPathAI is an end-to-end machine learning project that leverages advanced data analytics to predict loan defaults and optimize the loan recovery process. The project features a complete interactive web application built with Streamlit, allowing users to predict loan default risk in real-time using multiple machine learning models. By analyzing diverse customer data including demographics, financial history, and loan characteristics, this system helps financial institutions make informed lending decisions and reduce credit risk.
- Developer: Rajath Raj K T ๐จโ๐ป
- Program: Springboard Infosys Virtual Internship
- Mentor: Dr. N Jagan Mohan
- Organization: Springboard Infosys
- Year: 2025
- Predict loan defaults with high accuracy using 7 different machine learning models
- Interactive web application for real-time loan default risk assessment
- Automate risk assessment in the lending process
- Compare model performance with comprehensive metrics (Precision, Recall, F1-Score)
- Optimize loan recovery strategies based on customer behavior patterns
- Reduce financial losses by identifying high-risk borrowers early
- Improve decision-making through data-driven insights and visual analytics
The project uses a comprehensive loan dataset with 24+ key features:
- Gender, Age Bracket
- Region (North, South, Central, North-East)
- Credit Score, Credit Worthiness
- Debt-to-Income Ratio (DTI)
- Annual Income
- Credit Type (EQUI, CRIF, CIB, EXP)
- Loan Amount, Interest Rate, Loan Term
- Loan Purpose, Loan Type
- Loan Limit (Conforming/Non-Conforming)
- Pre-approval Status
- Loan-to-Value Ratio (LTV)
- Property Value
- Occupancy Type (Primary, Secondary, Investment)
- Total Units
- Business or Commercial Property
- Negative Amortization
- Interest Only
- Lump Sum Payment
- Submission of Application
- Status: Binary classification (0 = No Default, 1 = Default)
This project leverages two comprehensive datasets for robust loan default prediction:
- Source: Loan Default Dataset on Kaggle
- Description: Contains detailed loan application data with 24+ features including borrower demographics, financial metrics, and loan characteristics
- Usage: Primary dataset for model training and validation
- Source: Microsoft R Server Loan Credit Risk
- Description: Enterprise-grade dataset with borrower and loan information used for credit risk analysis
- Usage: Additional validation and comparative analysis
Both datasets provide complementary perspectives on loan default patterns, enabling more robust model development and validation.
CreditPathAI/
โโโ streamlit_app/ # Interactive Web Application
โ โโโ app.py # Main Streamlit application
โ โโโ utils.py # Utility functions for model loading & prediction
โ โโโ requirements.txt # App-specific dependencies
โ โโโ models/ # Trained model pipelines (generated locally)
โ โ โโโ logistic_regression_pipeline.joblib
โ โ โโโ random_forest_pipeline.joblib
โ โ โโโ xgboost_pipeline.joblib
โ โ โโโ decision_tree_pipeline.joblib
โ โ โโโ k-nearest_neighbors_pipeline.joblib
โ โ โโโ gaussian_naive_bayes_pipeline.joblib
โ โ โโโ bernoulli_naive_bayes_pipeline.joblib
โ โโโ __pycache__/ # Python cache files
โ
โโโ notebooks/ # Jupyter Notebooks
โ โโโ eda_report.ipynb # Exploratory Data Analysis (Kaggle dataset)
โ โโโ loan_default_main.ipynb # Main ML pipeline development
โ โโโ pre_processing__methods_2.ipynb # Data preprocessing experiments
โ โโโ pre_processing__methods_2_updated.ipynb # Final preprocessing & model training
โ
โโโ microsoft_notebooks/ # Microsoft Dataset Analysis
โ โโโ eda_report.ipynb # EDA for Microsoft dataset
โ โโโ microsoft_loan_default.ipynb # Microsoft dataset modeling
โ
โโโ Loan_Default.csv # Primary dataset (Kaggle)
โโโ Loan.txt # Loan data (Microsoft dataset)
โโโ Loan_Prod.txt # Production loan data
โโโ Borrower.txt # Borrower data (Microsoft dataset)
โโโ Borrower_Prod.txt # Production borrower data
โโโ Model_comparison.xlsx # Model performance comparison results
โ
โโโ requirements.txt # Project dependencies
โโโ .gitignore # Git ignore rules
โโโ README.md # Project documentation
โโโ LICENSE # MIT License
Note: Model files (*.joblib) are excluded from version control due to GitHub's file size limits. Users need to train models locally using the provided notebooks.
- Python 3.13 - Primary programming language
- Streamlit - Interactive web application framework
- Pandas - Data manipulation and analysis
- NumPy - Numerical computations
- Scikit-learn 1.7.2 - Machine learning algorithms and preprocessing
- XGBoost - Gradient boosting framework
- Logistic Regression - Baseline linear classifier
- Random Forest - Ensemble learning with decision trees
- XGBoost - Gradient boosting classifier
- Decision Tree - Single tree classifier
- K-Nearest Neighbors (KNN) - Instance-based learning
- Gaussian Naive Bayes - Probabilistic classifier
- Bernoulli Naive Bayes - Binary feature classifier
- Matplotlib - Static data visualization
- Seaborn - Statistical graphics
- Jupyter Notebook - Interactive development environment
- Joblib - Model serialization and persistence
- Git - Version control
- Virtual Environment (.venv) - Isolated Python environment
- Git LFS - Large file storage (for model files)
- Python 3.13 or higher
- Git
- Virtual environment support
git clone https://github.com/springboardmentor891v/CreditPathAI.git
cd CreditPathAI# Windows
python -m venv .venv
.venv\Scripts\activate
# macOS/Linux
python3 -m venv .venv
source .venv/bin/activate# Install main project dependencies
pip install -r requirements.txt
# Install Streamlit app dependencies
pip install -r streamlit_app/requirements.txtSince model files are not included in the repository (due to size constraints), you need to train them locally:
-
Open the training notebook:
jupyter notebook notebooks/pre_processing__methods_2_updated.ipynb
-
Run all cells in the notebook, especially the final cell that trains and saves all 7 model pipelines
-
Verify that
.joblibfiles are created instreamlit_app/models/directory
streamlit run streamlit_app/app.pyThe app will open in your browser at http://localhost:8501
- Model Selection - Choose from 7 different ML models
- Interactive Input Form - Enter loan applicant details via intuitive sliders and dropdowns
- Real-time Prediction - Get instant default risk predictions
- Risk Assessment - Color-coded risk levels (Low/Medium/High)
- Model Performance Metrics - View Precision, Recall, and F1-Score for each model
- Comparative Analysis - Compare performance across all models
The app accepts 24 different features including:
- Demographics: Gender, Age, Region
- Financial: Credit Score, Income, DTI Ratio
- Loan Details: Amount, Interest Rate, Term, Purpose
- Property: Value, Occupancy Type, Total Units
- And more...
- Prediction: Default or No Default
- Probability: Confidence score (0-100%)
- Risk Level: Low (๐ข), Medium (๐ก), or High (๐ด) risk assessment
- Risk Reduction: Early identification of potential defaulters with 7 different model approaches
- Cost Optimization: Reduced loan recovery costs through targeted strategies and accurate predictions
- Revenue Enhancement: Better loan approval decisions leading to improved profitability
- Process Automation: Streamlined lending workflow with AI-driven insights via web interface
- Real-time Decision Making: Instant risk assessment through interactive web application
- Model Flexibility: Choose the best model based on specific business requirements (precision vs recall)
- Scalability: Web-based deployment ready for enterprise integration
The project implements and compares 7 machine learning models:
| Model | Use Case | Key Strength |
|---|---|---|
| Logistic Regression | Baseline & interpretability | Simple, fast, interpretable coefficients |
| Random Forest | High accuracy | Ensemble learning, feature importance |
| XGBoost | Best performance | Advanced boosting, handles imbalanced data |
| Decision Tree | Rule extraction | Easy to visualize and explain |
| K-Nearest Neighbors | Pattern matching | Instance-based learning |
| Gaussian Naive Bayes | Fast prediction | Probabilistic, works well with continuous features |
| Bernoulli Naive Bayes | Binary features | Efficient for categorical data |
All models are evaluated using:
- Precision: Accuracy of positive predictions
- Recall: Ability to find all positive cases
- F1-Score: Harmonic mean of precision and recall
- Class Balancing: Handled through
class_weight='balanced'andscale_pos_weight
-
Analysis of customer demographics impact on default rates across two different datasets
-
Correlation between credit scores and repayment behavior patterns
-
Impact of loan-to-value ratio (LTV) and debt-to-income ratio (DTI) on default risk
-
Regional variations in loan default patterns
-
Optimal loan terms and interest rates for different customer segments
-
Pre-approval status as a significant predictor
-
Property occupancy type correlation with default probability
-
Comparative analysis between Kaggle and Microsoft datasets for model validation
-
Comparative analysis between Kaggle and Microsoft datasets for model validation
- Import datasets from Kaggle and Microsoft sources
- Exploratory Data Analysis (EDA) in dedicated notebooks
- Feature analysis and statistical summaries
- Handle missing values using appropriate imputation strategies
- Encode categorical variables (One-Hot Encoding)
- Scale numerical features (StandardScaler)
- Address class imbalance issues
- Create preprocessing pipelines using
ColumnTransformer - Separate numerical and categorical transformers
- Ensure consistent preprocessing across train/test splits
- Train 7 different classification models
- Use stratified train-test split
- Apply class balancing techniques
- Hyperparameter optimization for key models
- Compare models using Precision, Recall, and F1-Score
- Generate performance comparison reports
- Save trained model pipelines using Joblib
- Load trained models in Streamlit application
- Create interactive UI for real-time predictions
- Implement model selection and comparison features
- Advanced Feature Engineering: Derive new features from existing data
- Hyperparameter Tuning: GridSearchCV/RandomizedSearchCV for all models
- Deep Learning: Experiment with neural networks
- Model Explainability: SHAP values and LIME for model interpretation
- API Development: RESTful API for programmatic access
- Database Integration: Connect to production databases
- A/B Testing: Compare model performance in production
- Monitoring Dashboard: Track model performance over time
- Auto-retraining Pipeline: Automated model updates with new data
This project is part of the Springboard Infosys Virtual Internship program.
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
For major changes, please open an issue first to discuss proposed changes.
For major changes, please open an issue first to discuss proposed changes.
1. Model files not found error
Solution: Run the training notebook (pre_processing__methods_2_updated.ipynb)
to generate model files locally. Model files are excluded from Git due to size.
2. Package version conflicts
Solution: Use the exact versions specified in requirements.txt:
- scikit-learn==1.7.2
- xgboost (latest compatible version)
3. Streamlit app won't start
Solution: Ensure virtual environment is activated and all dependencies are installed:
pip install -r streamlit_app/requirements.txt
4. Import errors in notebooks
Solution: Install all project dependencies:
pip install -r requirements.txt
Rajath Raj K T ๐จโ๐ป
Springboard Infosys Virtual Intern
- Mentor: Dr. N Jagan Mohan
- Organization: Springboard Infosys
- Repository: GitHub - CreditPathAI
For questions, issues, or suggestions, please open an issue on GitHub.
This project is licensed under the MIT License - see the LICENSE file for details.
- Rajath Raj K T - Lead Developer & Machine Learning Engineer
- Dr. N Jagan Mohan - Project mentor for guidance and technical support
- Springboard Infosys - For providing the internship opportunity and resources
- Kaggle Community - For the comprehensive loan default dataset
- Microsoft - For the enterprise-grade credit risk dataset
- Open Source Community - For the amazing tools and libraries (Scikit-learn, XGBoost, Streamlit)
- Total Models: 7 Machine Learning Algorithms
- Dataset Features: 24+ loan and borrower attributes
- Data Sources: 2 (Kaggle + Microsoft)
- Notebooks: 6 comprehensive analysis and training notebooks
- Lines of Code: 2000+ (excluding libraries)
- Model Accuracy: Varies by model (see performance metrics in app)
โญ If you find this project helpful, please consider giving it a star on GitHub!
๐ผ Developed by Rajath Raj K T | Springboard Infosys Virtual Internship 2025
Note: This project is developed as part of an educational internship program and is intended for learning and demonstration purposes. It showcases end-to-end machine learning workflow from data exploration to deployment.