Data Preparation and Pre-Modeling ? What & Why?

Not having the correct and complete data is often the most cited reason for analytics projects failures, regardless of Big or Small data. To mitigate the problem, data-driven companies are giving importance to preparing and curating the data, and make it ready for analysis. It is a well-established fact that typically 60-70% of time in any analytics project is spent on data capture and preparation, and hence robust data management tools are important to drive efficiency and time savings. In a Predictive Modeling environment, data preparation is closely associated with the Pre-modeling phase.

In addition to creating metadata to describe the data, data preparation tools also perform the following steps:

  • Identify and understand the need for Missing Values

  • Convert Ordinal Data into Indicator Variable (Dummy Variables)

  • Transform data (Original Unit) to meet model assumptions

  • Formulation of Derived Variables from Direct Measures

The accuracy of the Predictive models ultimately resides with data completeness, correctness, and the algorithms chosen to construct the model.

Highlights of Serendio?s

PREMOD package

Practitioners & Users of R are expected to download multiple packages to perform the full gamut of pre-modeling steps. Our ?PREMOD? package brings all the functions in one unified package for ease of use and increased productivity.

Following are the key functions in our PREMOD package:

Standardize:

Following are the key functions in our PREMOD package:

To standardize data (original units) from ?X-Scale? to ?Z-Scale?.

Transformations:

Transformations of Counts & Proportions in order to meet model assumptions like VARIANCE STABILIZATION.

  • Count Transformations

  • Proportion Transformation (*Proportions that were arrived from Count Data)

Optimal Lamda:

A value that is required to transform data from Non-Normal to Normal. This is done by raising the Lamda Value as a power to the entire data set.

Creating Indicator Variables:

Converting NOMINAL data into Indicator Variables (*also known as Dummy Variables) in order to perform modeling. (E.g.) Reference Coding & Effect Coding

Graphical Summary:

Graphical Summary of Uni-variate data can be performed, which gives Visual Inspections like Histogram, Box-Whisker Plot, Run Chart & Auto-Correlation Chart.

Mean Absolute Deviation (MAD) & Mean Square Deviation (MSD):

Calculates MAD & MSD for the specified column. These techniques widely used in Time-Series analysis.

Normality Test:

Computes whether a set of values are Normally Distributed.

Descriptive Measures:

Computes Skewness & Kurtosis for a set of values.

Imputation:

Though IMPUTATION is not given in a form of function, in order to replace the MISSING VALUES in a data set, the set of existing values can be tested for NORMALITY. If data is NORMALLY DISTRIBUTED, replace the missing values with MEAN, else with MEDIAN.

To learn more and download the PREMOD package, go to

https://bitbucket.org/rajeshp_serendio/premod/downloads