Credit Scoring

Description

Credit scoring represents an evaluation of the bank's customer creditworthiness - how well they can and are willing to pay off their debt. According to the statistics, less than one-half of the banked population is regarded as eligible for lending. That's why the need for smarter credit scoring solutions is evident.

AI credit scoring decisions are based on a lot of data, such as total income, credit history, transaction analysis, work experience, etc. In essence, scoring represents a mathematical model based on statistical methods and accounting for lots of information. As a result, credit scoring using AI provides more sensitive, individualized credit score assessments based on an array of additional real-time factors, giving access to finance to more people with income potential. Here we demonstrate the credit scoring assessment in the Datrics platform, based on open sources' data.

Problem Statement

Create the Machine Learning model to evaluate the Credit Default Risk and build the ScoreCard for the new Customers' creditworthy assessment.

Dataset

A sampling of Loan Club public dataset, which contains 19894 records with bank's customer information - demographic data, financial data provided for the loan request, loan details, credit history, and the current payment status.

Features

Feature

Description

acc_now_delinq

acc_open_past_24mths

Number of trades opened in past 24 months.

accept_d

The date which the borrower accepted the offer

addr_city

The address city provided by the borrower during loan application.

addr_state

The address state provided by the borrower during loan application

annual_inc

The annual income provided by the borrower during registration.

MORE FEATURES

Target Variable

loan_failed - the binary derivative from loan_status

Scoring Model Training Scenario

General Schema of the Customer Credit Scoring can be depicted as a sequence:

Gather the data that can be used for the Customer's factors extraction

Describe the target variable and exclude the cases with unclear results

Generate the extra features that can be useful for the creditworthy assessment

Select the features that can be considered informative with respect to the target variable - this is required for the Scorecard creation.

Represent all features as binary variables by using encoding and binning

Train and Assess the Credit Default Risk model

Transform Model to the Scorecard

Datrics Pipeline

Pipeline Description

Data Preprocessing and Target Definition

BRICKS

Filter Columns Create New Column Join Data

Remove forward-looking features - information that is related to the current loan's payment statistics and we do not have it before the credit providing:

2. Upload Lending Club (lending_data.csv), including Fair Isaac Corporation scores (member_raiting.csv)

3. Define the Binary Target Variable: loan_failed:

Users that have the status "Current", "Late (13-30 days)" or "In Grace Period" are considered as open cases (cases with uncertain results) and will be excluded from the analysis

Resulted DataSet:

Extract Numerical Features for modelling

BRICKS

Typization Missing Value Treatment Apply Function Filter Rows

Remove all uninformative features

2. Remove Customers with uncertain result:

3. Treatment the missing values and extract the numerical information from the full-text descriptions (emp_length and term)

4. Introduce the Loan to Income rate:

Informative Features Selection and Quantization

BRICKS

Variable Selection Binning Encoding

Binning

We apply the Binning procedure to transform the numerical variable as the categorical one and then as a binary vector. We need to get this kind of representation for the ScoreCards creation.

Each feature is divided into three categories via Quantile-based auto-binning, and the less informative ones should be removed to avoid the multicollinearity problem. The feature fico_range_low was excluded as the feature with an extremely high correlation with fico_range_high, features total_rev_hi_lim, fico_range_low, installment and open_acc were not transformed to the categorical view as features with non-informative bins.

Create the Credit Default Assessment model

BRICKS

Union Data Classes Weighting Split Data Logistic Regression

Select the categorical features only - these features were removed before the Binning stage.

2. Union the categorical features with the informative binning categories

3. Assign the weights to the classes for the slowing the imbalanced classes problem

4. Split the dataset into two subsamples with stratification, train, and test, for the model training and validation.

5. Train the Default Risk Assessment model (Logistic Regression) with the previously defined classes weights

Model Perfomance

The trained model demonstrates good predictive abilities - Area Under Curve is equal to 0.7 with high Precision and sufficient Recall. The most important features are Grade, Annual Income, DTI rate, FICO rate, and Credit Purpose. Annual Income has a strong negative influence on the Rist Default - the higher Income, the less risk, and the credit purpose - credit card. The influence of the Assigned Grade has an expected behavior.

Model transformation to the Scorecards

BRICKS

Scorecard Predict

The trained model can be applied to the novel data - for the assessment of the Credit Default risk for the new Customers. Moreover, we may transform it to the Scorecards that provide a clear understanding of the different factors' contribution to the final Credit Score. In our case, we transformed the coefficients of the Logistic Regression to the factors' scores via Odds Scaling:

where, pdo - Points to Double the Odds, Target Score of 600 to mean a 20 to 1 odds (Target Odds) of good customer to bad, and an increase of 50 means a doubling odds

The final Scorecard looks like:

Scores are opposite to LogReg coefficients, because of the opposite purpose of the assessment - the LogReg model asses the Credit Default Risk, but the ScoreCard - the creditworthy.

Visualizations

The pipeline contains Chart components for the visualizations of the different dependencies between predictors and target variable and the Risk Assessment model quality. The charts below represent the dependencies of the credit default probability from:

ground trues groups - we can see that the model is able to differentiate risk/not-at-risk groups

grage A / not grade A - the Customers who belong to Grade A group are very likely will be assessed as creditworthy clients

grage G / not grade G - the Customers who belong to Grade G group are very likely will get the low Credit Score

Model Save and Deployment