What is a model?

Model represents a real world scenario with some Epsilon, where Epsilon represents the Error factor.

Y = f(X) + epsilon

What is an Imbalanced Target Variable?

Let us first go through few real time examples:

  • Telecom Domain:

In Telecom, subscribers tend to move frequently from one mobile operator to another for better service or offers. This phenomena known as Customer Churn, ranges from 5 to 10%. In order to model this, the entire customer database is coded into 1 – CHURN customers or 0 – ACTIVE customers. Since the number of Active customers far outweigh the Churn customers and the distribution of such is also not uniform, the data set is called Imbalanced.

# of Observation

Target Variable

Target Variable (Binary)

1

50,000

1 = CHURN customers

0

10,00,000

0 = ACTIVE customers

 

  • Healthcare Domain:

A multi-specialty Hospital wanted to predict whether a patient is prone to Diabetes now or in the near future.  Modern conveniences have resulted in a more sedentary lifestyle globally thus causing an explosion in the rate of Diabetes affliction. Recent studies have shown that close to 92.5% of all the patients were Diabetic or prone to Diabetes, and only 7.5% of the total patients were found to be healthy.

# of Observation

Target Variable

Target Variable (Binary)

1

925

1 = Patients prone to Diabetes

0

075

0 = Patients without Diabetes symptoms

 

What is a Rare Event?

An event is said to be rare if the number of times it occurs is very minimum or low

In both the scenarios mentioned above – Telecom & and Healthcare, the management was interested in predicting (modelling) CHURN customers & PATIENTS without Diabetes symptoms.  These two events are called RARE EVENTs, since its overall presence is relatively less when compared to the levels of the other TARGET VARIABLE (Y).

How will you statistically evaluate whether the Target Variable is imbalanced / skewed?

Perform a Chi-Square Test using the below command (*here it is being evaluated using R-Open Source software)

Chi-Square Test conducted using R-Software

Patient.Count

Diabetes                   925

Without Diabetes            75

Chi-squared test for given probabilities

Null Hypothesis : Data is uniformly distributed

Alternative Hypothesis: Data is not uniformly distributed

data:  Clinical.Test[, 1]

X-squared = 722.5, df = 1, p-value < 0.00000000000000022

 

Chi-Square Test conducted using Minitab

Chi-Square Goodness-of-Fit Test for Observed Counts in Variable: Count

 

Using category names in Disease

Category Observed Test Proportion Expected Contribution to Chi-Sq
Y 925 0.5 500 361.25
N 75 0.5 500 361.25

N  DF  Chi-Sq  P-Value

1000   1   722.5    0.000

 

As the ‘p-value’ < 0.05 (*which is commonly chosen Alpha value) we can Reject Null Hypothesis and conclude that ‘Data is not uniformly distributed’

How to overcome this problem?

This problem can be overcome by two main methods:

  • Sampling methods

ü  Over Sampling techniques

ü  Under Sampling techniques

  • Algorithms

ü  Penalized Likelihood Algorithms

Disclaimer:

This blog provides a Macro Level explanation on Imbalanced Targets (Y).  It is very important to employ sound countermeasures against imbalanced targets prior to any modeling activity.

Detailed blog on OVER SAMPLING will be published next.