Forecasting Rare Events: Credit Scoring
The Objective
The main question of this project is: Can machine learning techniques predict whether a customer will churn, file a claim, or repay a loan? And are these methods more accurate in their predictions compared to traditional statistical techniques like logistic regression? This use case explores these questions within the framework of binary classification, comparing the forecasting performance of various machine learning methods (including CatBoost, logistic regression with and without regularization, neural networks, LightGBM, and XGBoost). The analysis also explores topics like data preprocessing, model interpretability, overfitting and underfitting, and hyperparameter tuning. The goal is to provide an introductory guide to the application of actuarial data science methods on a supervised learning problem.
The Notebook
The analysis is conducted in a Python Jupyter notebook, which is publicly accessible. While this report provides a condensed summary of the key aspects and findings, the notebook contains more detailed examinations, comments, tables, and graphics. Interested individuals can comment on, copy, modify, and extend the notebook with their approaches. The notebook is available at the following link:
https://www.kaggle.com/floser/binary-classification-credit-scoring
Note: The DAV (German Actuarial Association) is not responsible for the code and data associated with Kaggle and referenced in these repositories. These reflect the individual opinions of the respective Kaggle users.
The Content
This actuarial data science use case is based on the application_train dataset from the Kaggle competition “Home Credit Default Risk.” This dataset, from an international bank, contains over 300,000 approved loan applications, with a label indicating whether there were payment difficulties. It includes 120 features that can be used for modeling, such as demographic information, credit and income details, and data from external sources. Since this dataset is quite imbalanced, with only about eight percent of applications showing payment difficulties, the forecasting performance is evaluated using the AUC (Area Under the Curve) metric.
Part A
of the notebook demonstrates how, with minimal data preparation, a high-quality forecasting model can be built using the CatBoost gradient boosting method with default parameters. Additionally, the built-in feature importances of the CatBoost model provide initial insights into the predictive power of individual features.
Part B
aims to gain deeper insights into the data and modeling. This part first covers classical logistic regression, followed by a brief exploratory data analysis, feature engineering (creating new features from existing ones), and model interpretability using the Explainable AI method “SHAP. ” It also discusses data preprocessing steps such as encoding, scaling, and subsampling of imbalanced data and examines their effects on the forecasting performance of the CatBoost standard model.
Part C
focuses on the optimization and practical application of machine learning models. This part addresses overfitting using regularized logistic regression and hyperparameter tuning in neural networks and gradient boosting methods such as CatBoost, LightGBM, and XGBoost. After comprehensive model evaluation using validation and test data, it concludes by discussing applications in high-risk areas.
The Findings
To conclude, here are the key insights for such machine learning problems:
- CatBoost: This machine learning method stands out due to its minimal data preprocessing requirements, ability to natively handle categorical features, and its quick and easy generation of high-quality results in binary classification problems.
- Feature Engineering: Enhancing existing data by generating new features (e.g., the ratio of credit to income) is a crucial step—if not the most crucial step—to significantly improve forecasting accuracy. This emphasizes the importance of domain-specific expertise in optimizing predictive models.
- Hyperparameter Tuning: While feature engineering has a more significant impact, hyperparameter tuning using cross-validation on powerful hardware (e.g., GPU-supported) can noticeably improve the performance of gradient boosting tools like LightGBM and XGBoost.
- Subsampling: Increasing the event rate from 8% to 25% by removing numerous non-events from the training data helps models balance the consideration of events and non-events and slightly improves forecasting accuracy. The reduced size of the training data also shortens computation times, enabling additional tuning strategies to further enhance prediction quality.
These results clearly demonstrate how the combination of machine learning methods and domain-specific knowledge can pave the way for more effective modeling.