• Partner portals
  • Werde Aktuar
  • EAA
  • actuview
  • actupool
Login
de
en
Logo
  • About us
    • Profession
      • Practice Areas
      • Associations
      • Voluntary Commitment
      • International Networking
      • Office
      • Job Market
    • Qualification
      • DAV Education and Training
      • CADS Education and Training
      • CERA Education and Training
      • IVS Education and Training
      • The DAA’s Education and Training Programme
      • Our Commitment to Quality
  • Knowledge
    • Specialist Information
    • Publications
    • Audio & Video
    • Regulations
  • Events
    • Offers
      • Annual Meeting
      • Autumn Meeting
      • 125th anniversary of DAV
    • Booking (German)
  • Newsroom
  • My DAV
Simon Hatzesberger und Friedrich Loser | 02/06/2024 | Actuarial Data Science
3 min reading time

Forecasting Rare Events: Credit Scoring

This use case demonstrates how machine learning techniques such as CatBoost and XGBoost can outperform traditional methods like logistic regression in predictive accuracy when forecasting customer behavior. The analysis highlights key topics such as data preprocessing, feature engineering, and hyperparameter tuning, providing valuable insights for precise and efficient modeling.

Committee for Actuarial Data Science

The Objective

The main question of this project is: Can machine learning techniques predict whether a customer will churn, file a claim, or repay a loan? And are these methods more accurate in their predictions compared to traditional statistical techniques like logistic regression? This use case explores these questions within the framework of binary classification, comparing the forecasting performance of various machine learning methods (including CatBoost, logistic regression with and without regularization, neural networks, LightGBM, and XGBoost). The analysis also explores topics like data preprocessing, model interpretability, overfitting and underfitting, and hyperparameter tuning. The goal is to provide an introductory guide to the application of actuarial data science methods on a supervised learning problem.

The Notebook

The analysis is conducted in a Python Jupyter notebook, which is publicly accessible. While this report provides a condensed summary of the key aspects and findings, the notebook contains more detailed examinations, comments, tables, and graphics. Interested individuals can comment on, copy, modify, and extend the notebook with their approaches. The notebook is available at the following link:

https://www.kaggle.com/floser/binary-classification-credit-scoring

Note: The DAV (German Actuarial Association) is not responsible for the code and data associated with Kaggle and referenced in these repositories. These reflect the individual opinions of the respective Kaggle users.

The Content

This actuarial data science use case is based on the application_train dataset from the Kaggle competition “Home Credit Default Risk.” This dataset, from an international bank, contains over 300,000 approved loan applications, with a label indicating whether there were payment difficulties. It includes 120 features that can be used for modeling, such as demographic information, credit and income details, and data from external sources. Since this dataset is quite imbalanced, with only about eight percent of applications showing payment difficulties, the forecasting performance is evaluated using the AUC (Area Under the Curve) metric.

Part A 

of the notebook demonstrates how, with minimal data preparation, a high-quality forecasting model can be built using the CatBoost gradient boosting method with default parameters. Additionally, the built-in feature importances of the CatBoost model provide initial insights into the predictive power of individual features.

Part B

 aims to gain deeper insights into the data and modeling. This part first covers classical logistic regression, followed by a brief exploratory data analysis, feature engineering (creating new features from existing ones), and model interpretability using the Explainable AI method “SHAP. ” It also discusses data preprocessing steps such as encoding, scaling, and subsampling of imbalanced data and examines their effects on the forecasting performance of the CatBoost standard model.

Part C

 focuses on the optimization and practical application of machine learning models. This part addresses overfitting using regularized logistic regression and hyperparameter tuning in neural networks and gradient boosting methods such as CatBoost, LightGBM, and XGBoost. After comprehensive model evaluation using validation and test data, it concludes by discussing applications in high-risk areas.

The Findings

To conclude, here are the key insights for such machine learning problems:

  1. CatBoost: This machine learning method stands out due to its minimal data preprocessing requirements, ability to natively handle categorical features, and its quick and easy generation of high-quality results in binary classification problems.
  2. Feature Engineering: Enhancing existing data by generating new features (e.g., the ratio of credit to income) is a crucial step—if not the most crucial step—to significantly improve forecasting accuracy. This emphasizes the importance of domain-specific expertise in optimizing predictive models.
  3. Hyperparameter Tuning: While feature engineering has a more significant impact, hyperparameter tuning using cross-validation on powerful hardware (e.g., GPU-supported) can noticeably improve the performance of gradient boosting tools like LightGBM and XGBoost.
  4. Subsampling: Increasing the event rate from 8% to 25% by removing numerous non-events from the training data helps models balance the consideration of events and non-events and slightly improves forecasting accuracy. The reduced size of the training data also shortens computation times, enabling additional tuning strategies to further enhance prediction quality.

These results clearly demonstrate how the combination of machine learning methods and domain-specific knowledge can pave the way for more effective modeling.

Content

  • Introduction
  • Das Ziel
  • Das Notebook
  • Die Inhalte
  • Die Erkenntnisse

GitHub-Account

This link takes you to the DAV GitHub account.

DAV GitHub-Account
Sinem Sarma-Günes
sinem.sarma-guenes​@aktuar.de +49 (0) 221 912 554-226

Verwandte Information

Claim Frequency Modeling GLM Deep Learning Gradient Boosting
03/01/2024 | Specialised Information
Claim Frequency Modeling in Insurance Pricing using GLM, Deep Learning, and Gradient Boosting

What added value can machine learning methods bring to claims pricing?
To answer this question, claim frequencies are modeled on a large French motor…

Standards of behavior Assumptions Methods Models Documentation Data quality Order acceptance actuarial services General principles
10/02/2024 | Specialised Information
International Standard of Actuarial Practice 1 (ISAP1) - General Actuarial Practice

This guideline defines general principles for all DAV actuaries that should be taken into account when providing actuarial services of any kind. It…

Artificial intelligence AI-Regulation Artificial-Intelligence-Act AI Regulation ethical AI High-risk systems AI models with a general purpose general-purpose AI-models
03/11/2025 | Specialised Information
The Artificial Intelligence Act in an actuarial context

The result report analyzes the impact of the EU AI Act on the insurance industry, particularly on actuaries. The AI Act, which came into force on…

Sitemap
  • Privacy
  • Complaint department
  • Contact
  • Legal notice/Imprint
  • About us
    • Profession
      • Practice Areas
      • Associations
      • Voluntary Commitment
      • International Networking
      • Office
      • Job Market
    • Qualification
      • DAV Education and Training
      • CADS Education and Training
      • CERA Education and Training
      • IVS Education and Training
      • The DAA’s Education and Training Programme
      • Our Commitment to Quality
  • Knowledge
    • Specialist Information
    • Publications
    • Audio & Video
    • Regulations
  • Events
    • Offers
      • Annual Meeting
      • Autumn Meeting
      • 125th anniversary of DAV
    • Booking (German)
  • Newsroom
  • My DAV