While I was working at Impress I was dealing with an interesting project. In simple terms, it was a text classification system where English text paragraphs will be classified into two classes ( hence a binary classification algorithm). But the most challenging part was the imbalance of the data set that I had to work with. Data imbalance means, out of the two classes you have one have significantly more examples than the other and it can induce biases to the model that we are generating.

Then obviously my research started to look into mathematical models which I can use for these kinds of problems. And a big point to note here that model selection is not the first part. The first and foremost part is to understand the data and select the features. I will discuss those in another blog. Here I am directly jumping into the final part, where I started looking into the models. There were some concerns that I wanted to address
1. The explainability of model
2. Complexity in recreating the algorithm
Anyway, I started learning some research papers for more insights and will discuss them in detail ...Let us dive in
Firstly, as I mentioned earlier we can address the data imbalance in the feature engineering part by doing some oversampling and/or under-sampling techniques. But in this blog, I will be more concentrating on how we can address the modelling part.
As a side note, I wanted to say the classification problems can also know as predictive modelling, in layman terms, we are predicting the probability of the coming input whether it will be in one class or not based on the features present in the training set. The probability of all the classes will sum up to 1. Furthermore, these imbalance classifications are also known as rare event prediction or extreme event prediction.
Imbalance can occur in many ways, it could be due to errors in data collection, or sample collection where biased. But here in our case, the imbalanced data was a characteristic of our problem statement.
Challenges of Imbalanced classification
There are two types of imbalances slightly skewed and severe imbalance. The slight imbalances are not normally a concern, the problems like fraud detection the imbalance ration are in the range of 1:1000. And another interesting factor to note is that the minority class ( the class with lesser examples) are our area of concentration. This means that the model’s skill incorrectly predicting the class label or probability for the minority class is more important than the majority class. The major challenge is that most machine learning algorithms for classification predictive models are designed and demonstrated on problems that assume an equal distribution of classes. This means that the naive application of a model may focus on learning characteristics of the majority class observations only, and will neglect the observations from the minority class
This is a very open problem wherein in most of the cases we have to identify and address it specifically for each dataset. And there are a plethora of examples out there pointing to imbalanced data classification.
So let us see how we can deal with our imbalanced data classification problem. Normally, there are few methods like SMOTE ( which we can discuss in another blog) to reduce the imbalance in the dataset or use general algorithms and try trial and error methods on the popular algorithm. But as engineers, we know the importance of reusing an existing algorithm right, so find a research paper that deals with a problem statement more or less similar to ours and try to recreate it and apply it.
For the trial and error method, it is time-consuming. So we have to be more organised about the things that we are planning to do.
So let us dive into the most awaiting question of this blog which algorithm to choose?
And the funny and intriguing answer is there is no method for this. According to Machine learning mastery ( which is a very good blog btw check it out 🔖), we can approach it two ways, First apply your favourite algorithm or apply an algorithm that previously worked.
So for getting the best result we have to try out available algorithms in a systematic fashion and for that, we need to set up a framework. Where you will be writing systematic ways to analyse the data and then select the features followed by throwing both linear and non-linear algorithms followed by the hyperparameter tuning for all of them. ( Actually, the all possible ways to do all these things are quite interesting and I think I should create separate blogs for these as a continuation to this blog Emoji idea) . Moreover, we have to pay more attention to the metric that we are choosing to evaluate these algorithms because based on these metric values that we are going to work around the hyperparameters.
I guess that is enough for the introduction to imbalanced data classification. In the next blog, I will discuss the metrics that we can select for evaluating the models, and the algorithms to classify and methods to tune the parameters and improve the performance of the algorithms