Thoughts on Machine Learning – Dealing with Skewed Classes

A challenge which machine learning practitioners often face, is how to deal with skewed classes in classification problems. Such a tricky situation occurs when one class is over-represented in the data set. A common example for this issue is fraud detection: a very big part of the data set, usually 9x%, describes normal activities and only a small fraction of the data records should get classified as fraud. In such a case, if the model always predicts “normal”, then it is correct 9x% of the time. At first, the accuracy and therewith the quality of such a model might seem surprisingly good, but as soon as you perform some analysis and dig a little bit deeper, it becomes obvious that the model is completely useless. Just recently I stumbled upon this issue in the context of multiclass classification, which even adds to the original problem. After trying different approaches, doing research, and giving it some thoughts, I was able to figure out a couple of workarounds and solutions which I would like to share with you in this post.

1. Precision & Recall

This isn’t really a solution to the problem, but it helps for evaluating the final model. As described in the fraud example, “accuracy” might not be the best metric for determining the quality of the model. Fortunately, the metrics “precision” and “recall” can help us out. Precision describes how many of the data records, which got classified as fraud, actually are illustrating fraudulent activities. On the other hand, recall refers to the percentage of correctly classified frauds based on the overall number of frauds of the data set. In case you are attracted to formulas:

    \[\text{Precision} = {\text{\#correctly\_classified\_fraud} \over \text{\#classified\_fraud}} = {\text{true\_positives} \over {\text{true\_positives}+\text{false\_positives}}}\]

    \[\text{Recall} = {\text{\#correctly\_classified\_fraud} \over \text{\#actual\_fraud}} = {\text{true\_positives} \over {\text{true\_positives}+\text{false\_negatives}}}\]

For multiclass classification, the usefulness of precision and recall as metrics is limited as you would have to compute and analyze those for every class. In my opinion, it is better to simply take a quick look at the confusion matrix for the classified test set and to determine whether at least some of the non-dominant classes got classified correctly.

2. Stratified Sampling

Once you split up the data into train, validation and test set, chances are close to 100% that your already skewed data becomes even more unbalanced for at least one of the three resulting sets. Think about it: Let’s say your data set contains 1000 records and of those 20 are labelled as “fraud”. As soon as you split up the data set into train, validation and test set, it is possible that the train set contains the majority of the “fraud”-records or maybe even all of them. Although not as likely, the same could happen for the validation or test set, which is even worse because then the machine learning algorithm has no chance at all to learn the hidden patterns of “fraud”-records. This is why you shouldn’t use random sampling when constructing the three different data sets, but stratified sampling instead. It assures that the train, validation and test sets are well balanced. Therewith the already existing problem of skewed classes is not intensified, which is what you want when creating high quality models. For R, the function “strata” of the package “sampling” provides functionality for stratified sampling.

3. Limit the Over-Represented Class

If you have a lot of data, then simply limit the allowed number of “not fraud” samples in your data set. Unfortunately in machine learning you can never have enough data and you usually have less than desperately needed. However, this method is also useful for smaller data sets as in most cases the left out data records wouldn’t add a substantial amount of information to the ones already included.

4. Weights/Cost

For most machine learning algorithms you can assign higher costs for “false positives” and “false negatives”. The higher the costs, the more the model is geared towards classifying something else. This way, you can influence either precision or recall and enhance the overall performance of the model. However, as one of the two metrics is going to become better, the other one inevitably will get worse. It’s a tradeoff. This principle of costs can also be applied to a multiclass classification scenario, where it is possible to assign different weights to specific classes. The higher the weight of a class, the more likely it is to be classified. For SVM in R, the argument “class.weights” of the “svm” function in the e1071 package allows for such an adjustment of the importance of each class. One more hint to ease understanding: To my knowledge weights are the same as reversed costs.

5. Treat it as an Anomaly Detection Problem

In anomaly detection, the basic idea is to predict the probability for every record to be an anomaly, e.g. fraud. This is done via looking at the values of the features of an unseen data record and comparing them to those of all other data records which are known to be normal. If the probability turns out to be below a certain threshold \boldsymbol{\epsilon}, for example 5%, then the data record is classified as an anomaly. Especially anomaly detection using a multivariate Gaussian distribution looks promising to me. However, anomaly detection cannot be applied to multiclass classification settings with skewed classes. The algorithm would only be able to tell which of the data records don’t belong to any of the labeled classes and therefore should be classified as something like “other”. Unfortunately, I don’t have any practical experience with anomaly detection algorithms yet and do neither know the performance nor the pitfalls of such an approach. But for further information, I can recommend the anomaly detection chapter of the Machine Learning class at coursera.

Hopefully this little list helps you once you have to deal with skewed classes. For me, applying the 2. and 3. approach did the trick. In case you know other effective ways of tackling this problem, I would be more than happy to read about it in the comments below.

Update: A lot of insightful comments were also posted on hackernews.

Share your love

10 Comments

  1. Hi There,

    Great article. I came across this class imbalance problem this summer. There is the undersampling and oversampling methods you’ve mentioned above, which unfortunately get rid of useful training data. One technique I found particularly nice is Ensemble Learning, which has to do with the weight/cost you’ve mentioned above. In this case, you don’t get rid of useful data, and you actually do what you really want to in the first place, which is to boost the presence of the minority class, which is done through an ‘ensemble’ of weak classifiers/learners with their associated weights.

  2. Hi Peter,
    specificity is not quite the same as precision because it focuses on the true negatives instead of the true positives. Other than that you’re right. It’s simply a different terminology in information retrieval. All of that is even mentioned at the end of the wikipedia article.

    Chris, Bruno,
    thanks for the feedback. On hackernews, Ensemble Learning was one of the mentioned solutions as well (Link: http://news.ycombinator.com/item?id=4440560). Seems to be worth a try!

  3. Be careful; if a bank used stratified data sets for their fraud detection, they’d get a lot of false positives.

    Only use stratified data if you consider the classes to be a priori uniformly distributed (that is, each class is equally likely) in the real world as well. Otherwise you are bending the real world to your classifier, not the other way around.

  4. I would suggest you to look at Hellinger distance based decision trees (HDDT) for class imbalance. HDDT is a powerful technique for class imbalance in binary classification and turn out to be moderately useful with multi-class problems.

    And with the respect to sampling based techniques for class imbalance, you can look SMOTE with data cleansing.

  5. Hi Florian!

    I see that your post is quite old yet, I hope you’ll read my comment anyway.
    I’m dealing right now with a fraud classification problem on a large dataset of skewed data.
    I’m trying to follow their method: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4338292/pdf/pone.0117844.pdf
    that partially matches the philosophy of your list of suggestions.

    I have one question/doubt about your point #2: I understand the need of re-balancing the training set but I’m wondering if it’s really correct to also re-balance the test set. My doubt comes from considering that in reality one always has skewed data to predict. So why testing the performance of the predictive model on a re-balanced test set?…..

  6. Hey Flaminia, stratified sampling is different from re-balancing (e.g. oversampling or undersampling). It makes sure that the ratio between the classes in the train, validation, and test set are the same as in the whole dataset which is a desirable property. By the way, over the years I’ve learned that if you’re using a properly regularized probability estimator like Logistic Regression, then imbalanced datasets aren’t an issue as long as you adjust the classification threshold accordingly.

  7. I agree with http://j.bitwiese.de/ Justin Bayer altering the distribution of data is repulsive.
    I think we must try harder to overcome with analytical techniques…
    Florian this is a very useful reflection and TY.

Leave a Reply

Your email address will not be published. Required fields are marked *