Explaining a fraud detection model with Cloud AI Platform

Codelabs Practice series 1

Posted by Bhagya C on May 7, 2021

Hello People, as you all know MLOPS is one of the most interesting areas related to AI/ML research nowadays. In this blog series, we will try to understand what exactly the term means and get to know about how to do some actions related to it. Let us start.

I was hoping that we are we can train using google Qwik Labs. But it seems like not free anymore. I remember starting some courses there and never completed because of my laziness and I deeply regret those amateur actions. So, please if you guys get any chances like that please do take leverage of them.


Hence we left with the option of Google Codelabs. Which is also not so bad in the case of learning stuff. 

Let us look into the problem statement if you want to try these things out feel free to check out google Codelabs as well.

In this blog what will be encountering is that how to use AI Platform Notebooks to build and train a model. Then understand the model’s prediction with Explainable AI SDK. And also in any other real-life problem, we will learn how to address data imbalance too. I have made a previous blog on addressing the same if you have time check that one too 🔖.

Defining our problem statement

Fraud detection is a subset of anomaly detection where we have to find the odd one from a set of other data. ML comes to play because writing rules will not come to an end unless there is no more outliers (or zeroth day cases). The big challenges of this problem statement are

  1. Data imbalance: where we have not of true values and very less false values
  2. Explainability: wherever we say this one is an anomaly we should and could explain why so

I am trying this using colab and Hence can’t use the explainable ai sdk. But I will try to address that in the blog anyway how we can use it and how they perform kind of stuff in the end.

The steps that I am following the code are as follows:

  1. Import all the necessary packages
  2. Download the Dataset ( we are using financial fraud dataset available in kaggle ) 

Addressing the data imbalance issue

Once you have downloaded the data you can see we have only 1% of fraud example in the data.
To address this there are many methods available ( which I briefly mentioned in my previous blog) one of them is downsampling. Here in downsampling instead of taking the entire set of majority data we do random sampling and take a part of it so that the entire fraud data to non-fraud data ratio can be 25:75 from 1:99.

After addressing the data imbalance issue. Split the data into train and test. The different features ( column ) the data are in a different range to address this issue we have to normalise the data into a common range( if we were using a tree-based algorithm this step is not necessary) 

The next step is also crucial in terms of data imbalance. We know that in our data we have more non-fraud data. Even though we did downsampling we did not go for an equal match because we need to preserve the information, hence we have still the imbalanced data. However, our primary goal is to find the fraud examples hence we have to provide some weights to the fraud class so that our model can understand this is something that we are keener to know about.

In Keras, there is a ‘class_weight’ parameter that let us specify exactly how much weight we want to give examples from each class based on how often they occur in the dataset

Then we moved to the big part of Traning and evaluating the model

Here we are using a Keras Sequential model API, which lets us define our models as a stack of layers. There are lots of metric that we need to take care of while training

We define our model with all parameters and some global parameters and stopping conditions. Hence we set up the model by calling the “make_model()” then we fit our data to that model which is training. Please keep your eye on training on understanding what all are the metric being calculated and results that we are getting. 


Once the training is done we can visualise the model metrics. The main metrics that we use to understand the models are precision, accuracy, recall, AOC and loss. I will soon make another blog on understanding these things by looking into the values. ( at least for me I need to have more understanding of what exactly happening) 

From the result, we can see that we have considerably good results with 85% accuracy. And which is good. If you feel like something is missing or not satisfied with what happening inside. Then hopefully the new blog will address that issue.

Also in some later blogs, I will do a continuation of this one to learn how to tune these parameters that we defined earlier and improve our model performance.

Explainable SDK

As I said I could not test it out myself but I am adding the code snippets here and we will explore these in future

First,  we will save our model. And we have to create the metadata of the model that we created to pass that into our SDK

Check our the code for more details and this is a wrap for our blog and as I promised I will be back with more explanations