2/19/2023 0 Comments Bank loan risk engineThe following two graphs show how some of the average transaction amounts by transaction type and total sum of debits and credits could be used to differentiate default vs non-default. ![]() These features included monthly and yearly transactional summaries. Since the data was at transaction level and the target was on a quarterly scale, aggregations were applied to create useful account level features which could be used by a model to create one classification per account. Feature EngineeringĪnother issue to handle was data aggregation. While sampling was useful for training and testing the model, no sampling was done on the validation data – 100% of defaults and non-defaults were used – to show how the model would perform in real life. The training and test data was from 2012 –2017 and the test data was 2018 Q1 – Q3. Additional sampling steps were performed for the test data to ensure that it had the same default rate as the training data for fair evaluation of the model. In addition to this, different weights are assigned to default vs non-default in the model, which helps the model learn despite the imbalance. This changes the default rate from ~0.7% to ~2%. ![]() To counter this, the training data consisted of 100% of the defaults and randomly selected 30% of the non-defaults. Without additional sampling methods, this imbalance would cause the model almost exclusively learn the patterns of non-default accounts and perform poorly positive default accounts. As seen in the graph below, the default rate of accounts is 2% or less per quarter. One of the first issues to tackle was the large imbalance between the default and non-default classifications. Combining this information with age and type of loan provided a rich feature set to work from. Additionally, account balance, and age and type of account were included. The data consisted of account information including types, sources, locations, and amounts of transactions. Bank account transaction data provides insight into customer behavior and what patterns led to loan default. This solution was focused on determining how transactional banking information could be used to predict loan account default. The productionalization of the final model could also be implemented through scheduling of a versioned jupyter notebook or by wrapping the data preparation and model inside a Watson machine learning pipeline, which would allow the model to be called as an API. ![]() Analytics projects provided the option for shared direct access to data sources which simplified the data exploration and preparation process and accelerated the modeling process by enabling easily customizable and scalable jupyter notebook environments. The project was solved and implemented in IBM’s Cloud Pak for Data (CP4D) and its analytics projects, which are collaborative spaces for organizing and managing project resources and can be used for the full data science process. By applying a logistic regression to customer transaction data, we were able to increase the loan default prediction F1-score from 36% to 54%, significantly reducing potential losses. Classifying loan accounts that are likely to default within 12 months provides enough lead time to handle the at-risk accounts and mitigate losses, while not worrying over accounts which may still recover. This makes it a key problem for banks and other financial institutions to solve and using machine learning models to predict default accounts they can proactively handle problematic accounts and minimize risk. In this case, defaulting on a loan means failure to make loan payments for consecutive months. While the percentage of customers defaulting on loans is generally low, the financial impact can be disproportionately large.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |