Sample Page

This is an example page. It’s different from a blog post because it will stay in one place and will show up in your site navigation (in most themes). Most people start with an About page that introduces them to potential site visitors. It might say something like this:

Hi there! I’m a bike messenger by day, aspiring actor by night, and this is my blog. I live in Los Angeles, have a great dog named Jack, and I like piña coladas. (And gettin’ caught in the rain.)

…or something like this:

The XYZ Doohickey Company was founded in 1971, and has been providing quality doohickeys to the public ever since. Located in Gotham City, XYZ employs over 2,000 people and does all kinds of awesome things for the Gotham community.

As a new WordPress user, you should go to your dashboard to delete this page and create new pages for your content. Have fun!


Leave a Reply

Your email address will not be published. Required fields are marked *

  • 359
  • 355
  • 352
  • 349
  • 346
  • 342
  • 339
  • 336
  • 333
  • 331
  • 327
  • 325
  • 321
  • 318
  • 314
  • 311
  • 308
    What features of Sony PS3 do users care about the most?
  • 304
    Dell XPS 14z Vs Toshiba Portege Z835 Vs HP Envy 4 Sleek book Vs Acer Aspire S5
  • 301
    Windows 8 Smartphones – Nokia Lumia 920 vs HTC Windows Phone 8X
  • 298
    Lexicon based Sentiment Analysis in Social Media
  • 295
    Big Data Infrastructure Management – Created new Revenue Streams while Reducing costs by 60%
  • 292
    Next Generation Analytics Architecture for Business Advantage
  • 288
    A Scalable Data Transformation Framework Using the Hadoop Ecosystem

    Download (PDF, 1.1MB)

  • 476
    Predicting Customer Churn in Telecom

    Problem Description
    Consumers today go through a complex decision making process before subscribing to any one of the numerous Telecom service options – Voice (Prepaid, Post-Paid), Data (DSL, 3G, 4G), Voice+Data, etc. Since the services provided by the Telecom vendors are not highly differentiated, and number portability is commonplace, customer loyalty becomes an issue. Hence, it is becoming increasingly important for telecommunications companies to pro-actively identify customers that have a tendency to unsubscribe and take preventive measures to retain such customers.

    The aim of this blog post is to introduce a predictive model to identify the set of customers who have a high probability of unsubscribing from the service now or in the near future using Personal Details, Demographic information, Pricing and the Plans they have subscribed to. A secondary objective is to identify the features of the Independent Variables (aka ‘Predictors’) which cause a great impact on the Dependent Variable (Y) that makes causes a customer to unsubscribe.

    Data Description
    Input data:
    6 months data with 3 million transactions

    Predictors / Independent Variables (IV) considered:
    – Customer Demographics (Age, Gender, Marital Status, Location, etc.)
    – Call Statistics (Length of calls like Local, National & International, etc.)
    – Billing Information (what the customer paid for)
    – Voice and Data Product (Broadband services, Special Data Tariffs, etc.)
    – Complaints and Disputes (customer satisfaction issues and the remedial steps taken)
    – Credit History

    On the output:
    – Target / Response considered for the model:
    – The value ‘1’ indicates UNSUBSCRIBED or CHURN customers
    – The value ‘0’ indicates ACTIVE customers

    Note: For the sake of brevity, I am ignoring the steps taken to clean, transform, and impute the data.

    Partitioning the Data
    In any Predictive Model work, the data set has to be partitioned appropriately so as to avoid overfitting/underfitting issues among other things.


    Prediction Accuracy & Model Selection
    Models built on TRAINING DATA set is validated using the VALIDATION DATA set. It is common to build multiple models including ensembles and compare their performance. The model that eventually gets deployed is the one that benefits the business the most, while keeping the error rate within acceptable limits.

    Here are the 2 common error types in Churn Prediction:

    Type I Error – False Negative – Failing to identify a customer who has a high propensity to unsubscribe.

    From a business perspective, this is the least desirable error as the customer is very likely to quit/cancel/abandon the business, thus adversely affecting its revenue.

    Type II Error – False Positive – Classifying a good, satisfied customer as one likely to Churn.

    From a business perspective, this is acceptable as it does not impact revenue.

    Any Predictive Algorithm that goes into Production, will have to be the one that has the least Type I error.

    In our case, we used multiple algorithms on a Test data set of 300k transactions to predict Churn. Shown below are the results from the top 2 performing algorithms:

    Algorithm 1: Decision Tree



    Algorithm 2: Neural Networks



    Though the overall Error Rate of Neural Network was less than the Decision Tree algorithm, the Decision Tree model was chosen for deployment because of the higher Type I error rate for Neural Network.

    Model was chosen not only based on Prediction Accuracy, but also based on the impact of Type I Error.

    R was used to build, validate, and test the models with the 3 million transaction data set.
    It was re-implemented in Spark/MLLib/Scikit-Learn/HDFS to deal with larger data sets.

    – Model predicts the likelihood of Customer Churn high accuracy.
    – Key variables that were impacting Customer Churn or causing significant impact on the ‘Y’ were:
    – Age (age groups 21-40)
    – Salary (lower salaries)
    – Data Usage (those who used more data services)

  • 195
    Increasing Your Online Sales Conversions

    The Internet is teeming with buyers, consumers and advertisers! In fact, reports indicate that United States alone has over 52,000 active online retailers. An average individual receives over 20,000 marketing messages per day—All these when the general consumer attention span is just about 5 seconds!

    As a seller you have that 5 seconds to get the consumers attention. For this you need to understand their behavior, interests and preferences—what exactly are they looking for? Will they buy now?  What will they pay for it? And suggest the right product or service at the right time.  This is where Big Data Science comes into play. Revealing the valuable secrets hidden in the zillion terabytes of data created everyday can enable you to create a personalized experience for the consumer.

    Data-driven techniques have ushered an era of extreme targeting and personalization especially in the areas of advertising and eCommerce. Every digital cookie crumb, every action (or inaction) is recorded, aggregated, and analyzed to the minutest of details. But do businesses actually see any improvement in conversions, revenues or loyalty through this extreme data-centric approach?

    To answer this question, let us look at the performance of two contrasting loyalty programs:

    • The Legacy Loyalty program: A generic, untargeted, one-size-fits-all technique. It was used to send emails, and text messages with links to coupons. In fact, it only used the consumer’s email and phone number; it did not take into account customer demographics, past purchase history, browsing habits or any other digital information. While this program was easy to deploy, it only improved sales minimally.
    • The Targeted Loyalty program: This program takes into account every aspect of the consumer profile and his/her behavior over time. Since the data necessary for analysis and targeting can only be collected over time, the program started with the one-size-fits-all, legacy approach and evolved over time to be smarter and personalized. The emails and the text messages were targeted based on customer profile, browsing/purchase/return history, and product inventory. Also, it integrated many disparate data sources, crunched through millions of records everyday, and used a lot more computing resources and special targeting algorithms.

    A mid-sized eCommerce company deployed both the Loyalty programs simultaneously for three months (A/B testing) on a select set of product categories to measure and compare the outcomes of the two approaches.

    Not surprisingly, the company’s existing Legacy Loyalty program indicated a meager 3% increase (it was considered meager only because the baseline was small), while the Targeted program showed a higher conversion rate of 9%.


    The Legacy Loyalty program was non-personalized; the marketing messages were the same for every consumer, whether they were looking for apparel, toys, cosmetics, or consumer electronics. On the other hand, the Targeted Loyalty program provided personalized recommendations based on Big Data Science techniques resulting in higher conversion rates.

    Today’s consumers have absolutely no time for clutter. Irrelevant information is cast side even before the first glance. A more refined approach as presented by the Targeted Loyalty program connects the Consumer with goods and services they need, thus increasing sales, higher customer satisfaction, and overall business efficiency.

  • 177
    Managed Services

    We provide on-going Support, Maintenance, Migration, Benchmarking, Infrastructure Management, and Training services for all our solutions.

    You are in reliable hands when it comes to Big Data Science!

    Starting at $3500 per month.

    Learn More

  • 175
    Implementation Services

    Using a team of highly experienced Data Engineers and Data Scientists, we build modern data analytics platforms that extracts insights and business value from data .

    Our agile methodology and pre-built solution accelerators ensure that key technology milestones are delivered quickly and within budget.

    We innovate with you and for you!

    Starting at $25,000.

    Learn More

  • 173
    Strategy and Advisory Services

    In collaboration with your executive and technical stakeholders, we evaluate the data-driven business objectives and recommend an implementation roadmap.

    Highlights include a prioritized list of business use cases; key technologies, architecture and infrastructure; and skills gap analysis.

    We help transform your data-driven business vision to reality!

    Starting at $10,000.

    Learn More

  • 1
    Hello world!

    Welcome to WordPress. This is your first post. Edit or delete it, then start blogging!

  • 198
    Finding the Right Social Influencer for Your Brand

    Social media is an integral part of our lives. Every consumer you want to target is either making a purchase, checking into a restaurant or planning a holiday or hiring a cab online. Facebook alone has over 1 billion users and on any given day twitter sees over 500 million tweets! Reports suggest that 73% of millennials feel that it is their responsibility to help friends and family make smart purchase decisions.

    Consumers are strongly opinionated and the way they are influencing each other online is creating a whole new type of marketing, which we now call ‘Social Influence’! There is a very clear and binding relationship between social influencers, the brand and the buying decision.

    Traditionally, Brand awareness was done through a combination of targeted ads and celebrity endorsements. However, today a new trend that is fast emerging is identifying influencers—who are already talking about your brand and to use them to promote products and services within their social circles. Though this technique is cost-effective and highly targeted, the challenge is to identify these key influencers that talk specifically about topics related to your brand.

    What we need is a reliable and transparent quantitative method. A good indicator of a valuable influencer is his/her ‘Klout Score’. This score is based on overall social activity of an individual and not on specific topics (e.g., Obama has a Klout score of 94 on 100. This is his score regardless of which topic he is talking about).

    But, we need to have an Influence scoring algorithm that will address the shortcoming of Klout by scoring the influencer by content sources and specific topics. This approach gives content marketing companies the maximum flexibility to discover and engage with social influencers, and through them promote goods and services.


    The influence score can be computed through a statistical technique called Principal Component Analysis (PCA). Simply put, PCA is a technique, which enables a high number of variables/dimensions (i.e., reach and engagement metrics such as, likes, followers, re-tweets, favorites, etc.) to be described adequately by a smaller set of dominant variables/dimensions (Specific topics/subjects that we are looking at) without any loss of information.

    With PCA you can compute the influence score for a specific social media site and category/topic combination. Here is a step-by-step example to spot social influencers on twitter:

    • Step 1: Get the metadata (e.g., Barrack Obama—Followers: 55 million, Following: 650k, Tweets: 13.1k).
    • Step 2: Collect metadata for a particular set period (e.g., it can tweets for the last week, or month or a particular period maybe 3 months, etc.)
    • Step 3: Merge the collected metadata with ‘Post’ level metadata as shown below:

    Twitter Dataset used for Politicians

    Personality Tweets Followers Following Net Followers Re-Tweets
    1 Obama 13100 53700000 645000 53055000 1309 2144
    2 Francois Hollande 4362 858000 1563 856437 10120 4920
    3 Vladimir Putin 636 194000 9 193991 59.2 82.4
    4 Al Gore 1492 2790000 27 2789973 122.8 90.8
    • Step 4: Compress the dimensionality (i.e., Facebook likes, tweets, etc.) and compute weightages using PCA
    • Step 5: Get down to the granular level of data. This will help you understand exactly what the data represents and why the individual you have chosen ranks as a social influencer for your brand.
    • Step 6:  Explore the granular level of data individually for all members for the set category and rank them as shown in the table below:


    Politicians Weightage_pc1 Weightage_pc2 Distance Rank using PCA Klout
    Obama 1.668673 -1.77429135 2.435689 1 99
    Francois Hollande 1.235698 2.06373893 2.405404 2 88
    Vladimir Putin -1.532274 -0.08591237 1.537942 3 90.8
    Al Gore -1.372097 -0.20353522 1.384622 4 86

    PCA helps you derive the ‘distance’ values and transpose to a scaled score from 1–100. In the above tables the Klout score is shown for comparison purposes only. By repeating the above process for different social media sites and categories/topics, we could compute the Influence Source, by each topic.

    Whatever your niche, you can engage with highly ranked influencers to create/promote brand awareness and foster loyalty by giving your audience exactly what they expect from your brand.

  • 202
    Modelling Imbalanced Target Variable

    What is a model?

    Model represents a real world scenario with some Epsilon, where Epsilon represents the Error factor.

    Y = f(X) + epsilon

    What is an Imbalanced Target Variable?

    Let us first go through few real time examples:

    • Telecom Domain:

    In Telecom, subscribers tend to move frequently from one mobile operator to another for better service or offers. This phenomena known as Customer Churn, ranges from 5 to 10%. In order to model this, the entire customer database is coded into 1 – CHURN customers or 0 – ACTIVE customers. Since the number of Active customers far outweigh the Churn customers and the distribution of such is also not uniform, the data set is called Imbalanced.

    # of Observation

    Target Variable

    Target Variable (Binary)



    1 = CHURN customers



    0 = ACTIVE customers


    • Healthcare Domain:

    A multi-specialty Hospital wanted to predict whether a patient is prone to Diabetes now or in the near future.  Modern conveniences have resulted in a more sedentary lifestyle globally thus causing an explosion in the rate of Diabetes affliction. Recent studies have shown that close to 92.5% of all the patients were Diabetic or prone to Diabetes, and only 7.5% of the total patients were found to be healthy.

    # of Observation

    Target Variable

    Target Variable (Binary)



    1 = Patients prone to Diabetes



    0 = Patients without Diabetes symptoms


    What is a Rare Event?

    An event is said to be rare if the number of times it occurs is very minimum or low

    In both the scenarios mentioned above – Telecom & and Healthcare, the management was interested in predicting (modelling) CHURN customers & PATIENTS without Diabetes symptoms.  These two events are called RARE EVENTs, since its overall presence is relatively less when compared to the levels of the other TARGET VARIABLE (Y).

    How will you statistically evaluate whether the Target Variable is imbalanced / skewed?

    Perform a Chi-Square Test using the below command (*here it is being evaluated using R-Open Source software)

    Chi-Square Test conducted using R-Software


    Diabetes                   925

    Without Diabetes            75

    Chi-squared test for given probabilities

    Null Hypothesis : Data is uniformly distributed

    Alternative Hypothesis: Data is not uniformly distributed

    data:  Clinical.Test[, 1]

    X-squared = 722.5, df = 1, p-value < 0.00000000000000022


    Chi-Square Test conducted using Minitab

    Chi-Square Goodness-of-Fit Test for Observed Counts in Variable: Count


    Using category names in Disease

    Category Observed Test Proportion Expected Contribution to Chi-Sq
    Y 925 0.5 500 361.25
    N 75 0.5 500 361.25

    N  DF  Chi-Sq  P-Value

    1000   1   722.5    0.000


    As the ‘p-value’ < 0.05 (*which is commonly chosen Alpha value) we can Reject Null Hypothesis and conclude that ‘Data is not uniformly distributed’

    How to overcome this problem?

    This problem can be overcome by two main methods:

    • Sampling methods

    ü  Over Sampling techniques

    ü  Under Sampling techniques

    • Algorithms

    ü  Penalized Likelihood Algorithms


    This blog provides a Macro Level explanation on Imbalanced Targets (Y).  It is very important to employ sound countermeasures against imbalanced targets prior to any modeling activity.

    Detailed blog on OVER SAMPLING will be published next.

  • 200
    Pre-Modeling Routines in R
    Data Preparation and Pre-Modeling – What & Why?

    Not having the correct and complete data is often the most cited reason for analytics projects failures, regardless of Big or Small data. To mitigate the problem, data-driven companies are giving importance to preparing and curating the data, and make it ready for analysis. It is a well-established fact that typically 60-70% of time in any analytics project is spent on data capture and preparation, and hence robust data management tools are important to drive efficiency and time savings. In a Predictive Modeling environment, data preparation is closely associated with the Pre-modeling phase.

    In addition to creating metadata to describe the data, data preparation tools also perform the following steps:

    • Identify and understand the need for Missing Values

    • Convert Ordinal Data into Indicator Variable (Dummy Variables)

    • Transform data (Original Unit) to meet model assumptions

    • Formulation of Derived Variables from Direct Measures

    The accuracy of the Predictive models ultimately resides with data completeness, correctness, and the algorithms chosen to construct the model.

    Highlights of Serendio’s

    PREMOD package

    Practitioners & Users of R are expected to download multiple packages to perform the full gamut of pre-modeling steps. Our ‘PREMOD’ package brings all the functions in one unified package for ease of use and increased productivity.

    Following are the key functions in our PREMOD package:


    Following are the key functions in our PREMOD package:

    To standardize data (original units) from “X-Scale” to “Z-Scale”.


    Transformations of Counts & Proportions in order to meet model assumptions like VARIANCE STABILIZATION.

    • Count Transformations

    • Proportion Transformation (*Proportions that were arrived from Count Data)

    Optimal Lamda:

    A value that is required to transform data from Non-Normal to Normal. This is done by raising the Lamda Value as a power to the entire data set.

    Creating Indicator Variables:

    Converting NOMINAL data into Indicator Variables (*also known as Dummy Variables) in order to perform modeling. (E.g.) Reference Coding & Effect Coding

    Graphical Summary:

    Graphical Summary of Uni-variate data can be performed, which gives Visual Inspections like Histogram, Box-Whisker Plot, Run Chart & Auto-Correlation Chart.

    Mean Absolute Deviation (MAD) & Mean Square Deviation (MSD):

    Calculates MAD & MSD for the specified column. These techniques widely used in Time-Series analysis.

    Normality Test:

    Computes whether a set of values are Normally Distributed.

    Descriptive Measures:

    Computes Skewness & Kurtosis for a set of values.


    Though IMPUTATION is not given in a form of function, in order to replace the MISSING VALUES in a data set, the set of existing values can be tested for NORMALITY. If data is NORMALLY DISTRIBUTED, replace the missing values with MEAN, else with MEDIAN.

    To learn more and download the PREMOD package, go to

  • 204
    The Art of Big Data Science Hiring…Starts with Training

    Hiring good quality talent for Big Data Science is a challenge regardless of location – Silicon Valley, Bangalore, Beijing…it doesn’t matter. As a Big Data Science startup, we realized unless we took control of the hiring situation, our raison d’etre would be in jeopardy. So we have decided to launch our Training program – a program with emphasis on hands-on programming, taught by our in-house Architects and Developers who will focus on imparting skills we, at Serendio, care about. We did explore numerous options including working with Training institutes. But their focus was always on giving out certificates, based on a very simplistic and often outdated syllabus, more theory less projects, and above all the quality of teaching was very specious.

    How will our training be different?

    The Big Data Science technology stack is fast evolving with new and better techniques coming in every day and we at Serendio create and exploit such techniques on a daily basis. We want to emphasize the Lab experience in Big Data Science training where the focus is on Learning by Experiment. Knowing when to use what technologies is becoming an important trait for Big Data engineers. And our team’s collective experience is going to help in imparting this skill to our potential hires.

    We will be using our soon to be launched Training as a way to filter and identify quality talent for our own hiring needs. Post the training, we will be conducting screening tests and making job offers to those who meet our criteria. We are flipping the model of Hiring first, Training next pursued by major IT companies like Infosys, Cognizant etc. For a start up to indulge in Training the way we plan to do is unheard of…but we are not afraid to take the Hiring bull by its horns. We want to Train and Hire highly motivated Big data engineers to help us grow.

    Are you ready to join our team?
  • 206
    A Storm in a big cup

    All Hindu gods pose with a weapon in their hands. Quite an array of intriguing weapons are used in a variety of wars/battles in Hindu mythology. One such weapon is the trishul (trident – where tri root means the same in the both the english and sanskrit word, because of the indo-european heritage of sanskrit). In order to win the Big Data war you need a Trishul – if Hadoop MapReduce is one prong of this trident, the Spark framework the other, the third one I am going to discuss today is Storm – real-time stream processing on a distributed scale.

    Hadoop MapReduce is all about distributed computing on data on disks/file system. Spark is all about MR programming on data which can reside totally in memory. Storm is about distributed computing on data that is streaming in, probably at a high velocity and volume.

    Storm considers a stream as a never-ending sequencing of tuples. While in Hadoop data resides in files across nodes in cluster, the data for Storm comes from source of tuples called Spouts. The Spouts can send the stream to tuple processors called Bolts. The Bolts can send the tuples, same or modified to one or more other Bolts. It is the sequencing of the Bolt and shuffling of tuples across them that lets you accomplish an analytics computation on the incoming Stream. This combination of Spouts and Bolts is called a Topology in Storm (like a Job in Hadoop/MR)


    One good example to understand MapReduce was the wordcount example. Lets see how this wordcount example would work in Storm. Bolts would just keep sending sentences. The first stage of Bolts (say SplitBolt) would split it into words. Just like shuffling in Hadoop, you can have Storm key-off on one of the fields in the tuple to send it to a specific Bolt. So the one field of the tuple output of the SplitBolt would be used just like the key in Hadoop/MapReduce and instances of the same word would be sent to the same Bolt (say CountBolt).

    It is easy to visualize parallelism in the Hadoop world, because large set of data can be split into blocks and can be individually processed. In the case of Storm, it is the number of Spout and Bolt instances that you an specify provides the parallelism and distributed computing power. Spouts take on the actual load and the Bolts provide the processing power. Most likely, you will use Storm with something like Flume to concentrate thousands of sources of data and concentrate them to a few sinks which can talk to a Storm spout.

    In Storm, a cluster master is called Nimbus and the worker nodes called the Supervisors. Just in like in MapReduce, the Supervisors are configured with slots i.e number of worker processes they can run on a node. You may want to tune this based on the number of cores and network load (vis-a-vis disk I/O for MapReduce). The Nimbus and Supervisors use a Zookeeper cluster for co-ordination.












    It is possible in Storm that while a Topology (job) is running you can change the number of workers and executor threads, using the rebalance command. Looks like number you set in the code is the maximum and you can only set it lower. So I commented out the code and it ran with the max on my machine/supervisor.

            TopologyBuilder builder = new TopologyBuilder();
            builder.setSpout("spout", new RandomSentenceSpout(), 5);
            builder.setBolt("split", new SplitSentence(), 8)
            builder.setBolt("count", new WordCount(), 12)
                     .fieldsGrouping("split", new Fields("word"));
            Config conf = new Config();
            if(args!=null && args.length > 0) {
                System.out.println("Remote Cluster");
                StormSubmitter.submitTopology(args[0], conf, builder.createTopology());
            } else {
                System.out.println("Local Cluster");
                LocalCluster cluster = new LocalCluster();
                cluster.submitTopology("word-count", conf, builder.createTopology());

    If you run your storm-starter example jar packaged using maven like this

    kiru@kiru-N53SV:~/storm-0.8.2$ bin/storm jar ~/storm-starter/target/storm-starter-0.0.1-SNAPSHOT-jar-with-dependencies.jar storm.starter.WordCountTopology word-count
    Running: java -client -Dstorm.options= -Dstorm.home=/home/kiru/storm-0.8.2 -Djava.library.path=/usr/local/lib:/opt/local/lib:/usr/lib -Dstorm.conf.file= -cp /home/kiru/storm-0.8.2/storm-0.8.2.jar:/home/kiru/storm-0.8.2/lib/jetty-6.1.26.jar:

    You will see the following in the Storm UI (default – http://localhost:8080 after you run bin/storm ui)
    Note, we specified the parallelism hints to Storm with 5 for the Spout, 8 for the Split blot and 12 for the count.

    Spouts (All time)

    Id Executors Tasks Emitted Transferred Complete latency (ms) Acked Failed Last error
    spout 5 5 760 760 0.000 0 0

    Bolts (All time)

    Id Executors Tasks Emitted Transferred Capacity (last 10m) Execute latency (ms) Executed Process latency (ms) Acked Failed Last error
    count 12 12 4840 0 0.021 0.368 4840 0.361 4820 0
    split 8 8 4880 4880 0.001 0.077 780 8.590 780 0

    And there will be 8 word-count processes running and if you rebalanced it like this –
    kiru@kiru-N53SV:~/storm-0.8.2$ bin/storm rebalance word-count -n 6
    You can see only 6 worker processes running after rebalancing is complete.

    You can for example change the number of Spout executors like this –

    kiru@kiru-N53SV:~/storm-0.8.2$ bin/storm rebalance word-count -e spout=4

    And your Storm UI will report as below. Similarly, you can change the instances of the Split and Count bolts as well.

    Spouts (All time)

    Id Executors Tasks Emitted Transferred Complete latency (ms) Acked Failed Last error
    spout 4 5 0

    Some installation notes – Storm does have some native component – ZeroMQ. So this and its Java binding needs to be built on your box for it to work. You also have to run a Zookeeper installation. There is a Java equivalent for ZeroMQ called JeroMQ, but Nathan, the Storm lead, does not want to use it, but build one specifically for Storm. This is a good example for a situation where a engineering manager has to make a call between an Open Source/third-party library vis-a-vis building his own – build-vs-buy – independence over investment in efforts.

    So Storm is a good framework for scalable distributed processing, but then just like in the MR world you do not have time/inclination/resources to process/program at this level. Sure, Storm comes with a API for doing SQL like processing/aggregation. It is called Trident !! Didn’t we talk about Trident etc earlier !! 🙂

  • 273
    A Rose by any other name ..

    ..would smell as sweet, wrote Shakespeare. An Integer in any other form is not the same, is the programmers dictum. Well, it is clear an integer takes up four bytes on Java, but if this is represented as a string would take up any number of bytes based on how big the number is. So when dealing with data structures, whether on disk or in memory we consciously use the right datatypes. It also a requirement if you are computing with it and you can avoid the conversions from string to the datatype and vice versa. But I ran into an interesting behavior with integers and HashMaps in Java. If you used an Integer as a key, the performance of the Map is better than when using the same integer represented as a string !!!

    I also compared the performance with the Trove, High Performance Collections for Java Library. Surprisingly, the HashMap<Integer, Integer> beats even Trove.  You will surely think twice before using a String key any more !!! 🙂 .  (The numbers below are for a millions operations, so dont bother if you are dealing with smaller collections. I used JDK16 for these tests)

    String hasing Put time:358
    String hasing Get time:1404
    Integer hashing Put time:171
    Integer hashing Get time:57
    THashMap Integer hashing Put time:424
    THashMap Integer hashing Get time:72
    TIntIntHashMap Integer hashing Put time:262
    TIntIntHashMap Integer hashing Get time:55

    But when you are dealing with large size collections and performance matters, you really need to check out the Trove library. Over and above, the performance of the essential methods, they have some convenience methods which will save CPU time and improve performance. Two such methods that I would like to point out are –

    Integer putIfAbsent(Integer, Integer)
    adjustValue(int key, int amount) // available only in the TIntIntHashMap implementation

    If you have worked with Maps you would immediately recognize the utility value of these two methods. The first saves a containsKey() call and the second would help you with frequency maps, saving a get() call in the process.

    If you are like me, dealing with integer keys that originated in a database (either Oracle sequence or SQL Server identity) the HashMap<Integer, *> would be an ideal choice. As always do not forget to initialize your Maps and Lists with appropriate initial capacity.

    ArrayList(int initialCapacity)
    Constructs an empty list with the specified initial capacity.
    HashMap(int initialCapacity) 

    Constructs an empty HashMap with the specified initial capacity and the default load factor (0.75).
    HashMap(int initialCapacity, float loadFactor) 

    Constructs an empty HashMap with the specified initial capacity and load factor.

    The test program I used for this, is available for download here.

  • 211
    Understanding Big Data workloads (and more)

    Hadoop/MapReduce has jump started a revolution in large scale data processing, which earler was either unfeasible or uneconomical. Now, it is possible to use the power of commodity hardware to load up data on disks on a cluster and process them in parallel. MapReduce makes it possible to take the computation to where the data is resident. In Hadoop, with HDFS, the data is on disks. A Hadoop job works on a HDFS input directory/files and outputs the data to HDFS files as well. While developing a Hadoop Cluster monitoring/management product for Splunk, I was faced with a way to simulate workloads on the cluster in our Dev/QA environment. In order to divide/conquer the issue. I decided to classify the work load as follows –


    1. Category I – Large Input Size/Small Output Size
    2. Category II – Large Input Size/Large Output Size

    I also added a processing time equivalent to it, long and short durations. So I had to write four simple MR jobs to simulate this. But before that I was able to simulate Category I loads with a Hive query and Category II with Terasort. Also, outside of these categories, the cluster can be used just as storage and I used the stock TestDFSIO test program to simulate this case. We were running quite peacefully with out testing and simulation for the product. But one thing that had me concerned was the Hive queries. Even simple queries would take a long time (vis-a-vis MySQL or Oracle – when the dataset size was small). Hive runs a sequences of more than one jobs for one query. Each job outputs to HDFS and this becomes the input for the next job in the chain. This repeated data going back to disk and back and forth takes it toll on the performance. This lead me to the search of a framework that will help me to do some processing in-memory in a distributed manner – Spark is the answer.

    I thought I will give Spark a try. I cloned from the master branch but ran into issues compiling because of a repository issue (which has since been resolved). I then downloaded spark-0.6.2. On a Ubuntu 12.04 box, I had to install typesafe-stack first and then install scala (instructions on installing the typesafe stack is here) . Installing using the scala deb package caused the libjline-java incompatibility issue . I set the SCALA_HOME to /home/ubuntu/.sbt/boot/scala-2.9.2 on my Amazon ubuntu instance.

    I chose to run the JavaHdfsLR example that comes with Spark to see how it scales. I did not do a major benchmarking exercise but enough to convince myself that I got a good idea of Spark. I ran Spark in standalone mode – that is Spark speak for running its own master and worker program directly on the OS, instead of using something like Mesos or YARN. I made a simple python script to generate a million rows of data and used this with the example program. I decided to run a Spark Master on a Amazon Medium size Ubuntu VM and two small worker VMs (created from my own AMI). The example program completed in 32 seconds for two workers versus 39 seconds for a single work. Though this is not linear, I think the benefits will be better for larger amounts of data.

    Now lets take a look at the example to see the power of Spark when it comes to iterative computing common in machine-learning computations.

     public static void main(String[] args) {
        if (args.length < 3) {
          System.err.println("Usage: JavaHdfsLR   ");
        JavaSparkContext sc = new JavaSparkContext(args[0], "JavaHdfsLR");
        JavaRDD lines = sc.textFile(args[1], 4);
        JavaRDD points = ParsePoint()).cache();
        int ITERATIONS = Integer.parseInt(args[2]);
        // Initialize w to a random value
        double[] w = new double[D];
        for (int i = 0; i < D; i++) {
          w[i] = 2 * rand.nextDouble() - 1;
        System.out.print("Initial w: ");
        for (int i = 1; i <= ITERATIONS; i++) {
          System.out.println("On iteration " + i);
          double[] gradient =
            new ComputeGradient(w)
          ).reduce(new VectorSum());
          for (int j = 0; j < D; j++) {
            w[j] -= gradient[j];
        System.out.print("Final w: ");

    Note, the data is loaded outside the for-loop and cache. This would not possible with pure MR programming. Now, if this were to be done inside the for-loop it would take about 1m34s (for a single worker). If you are using Hive, which translates a query into more than one job, each job would output to HDFS and the next job reads from HDFS, imagine, if each job output to an in-memory RDD and the other one read that as the input. Yes, that is possible with Shark, a Hive like implementation running on Spark.

    I think the power of Big Data will come from mixing and matching HDFS with other frameworks like Spark, enabling machine-learning performance on scales of data which was not possible earlier. Remember, simple algorithms on large scale data is better than complex algorithms on small-set of data.

  • 214
    Using the Cloudera Manager API with Splunk

    One of the most popular distributions of Hadoop is from Cloudera. Cloudera provides a management/monitoring when you their Enterprise Edition/support contract. With this tool, you can create Hadoop Clusters and monitor/manage them. One of the popular monitoring tools in the commerical world is Splunk. Splunk provides an app – HadoopOps  (which I helped develop) for monitoring a Hadoop cluster .  This is a fantastic tool and provides very nice cluster visualization.  Using HadoopOps, requires Splunk Forwarders to be setup on each node in your Hadoop cluster(s). If for any reason, you do not want to do this, but still want to see the status of your Cloudera Manager managed clusters in Splunk there is a way to do it. Splunk continues to be your Single Pane of Glass (SPOG) to your IT world while Cloudera happily manages the Hadoop cluster. The Cloudera Manager API comes to your rescue.

    Inspired by this blog by a Cloudera Manager development manager and my familiarity with HadoopOps, I thought I would write a small Splunk app to accomplish this.See below a screenshot of a panel from my app.







    It was a fun little project. See below my simple python code to get the cluster status from Cloudera Manager. Please note in the python the CM userid/password is hardcoded into it. You want to do this in a better way. Also, it should be possible to use the CM python egg file from a directory under the Splunk app itself. Splunk comes with its own version of Python and using the technique below (sys.path.add )is much cleaner without all the Splunk apps installing different their own python packages into Splunk’s python installation.

    import os
    import sys
    from cm_api.api_client import ApiResource
    api = ApiResource(‘localhost’, username=’admin’, password=’admin’)
    myclusters = api.get_all_clusters()
    for cluster in myclusters:
    thiscluster = api.get_cluster(
    hosts = api.get_all_hosts()
    services = thiscluster.get_all_services()
    for s in services:
    roles = s.get_all_roles()
    for r in roles:
    print,, r.type, r.hostRef.hostId, r.healthSummary
    # simulation code – adding fake datanodes/tasttrackers
    if (r.type == ‘DATANODE’ or r.type == ‘TASKTRACKER’) :
    for i in 1,2,3,4:
    print,, r.type, r.hostRef.hostId+str(i), r.healthSummary
    for i in 5,6:
    print,, r.type, r.hostRef.hostId+str(i), ‘BAD’
    view hosted with ❤ by GitHub

    I think the Cloudera folks have done a great job in providing the API in Python (my bias for Java aside) as it is good for scripting and so happens to be the natural language for integration with Splunk (which runs its UI in a Python app server). This app is pretty light weight and can be run from any host which has network access to the Cloudera Manager machine. You can also tune the frequency at which it runs. Also, you can enable alerts in Splunk when ever a host goes bad.

    I will upload my whole app later. Meanwhile you can checkout the macros.conf, savedsearches.confhere.

    The main challenges I faced in developing were to get Cloudera Manager to work properly on my Ubuntu laptop vis-a-vis DNS setup.
    My major challenge on my Ubuntu 12.04 laptop was to getting DNS configured properly. See theadvice from Cloudera on this – Run host -v -t A `hostname` and make sure that hostname matches the output of the hostname command, and has the same IP address as reported by ifconfig for eth0.
    In my case, I had to disable dnsmasq as part of the NetworkManager package and install dnsmasq separately as mentioned is this helpful blog post. (You also need to have password-less sudo and ssh configured correctly. See Cloudera requirements here)

    I think using Splunk along with any bundled management/monitoring capabilities of any product provides a more comprehensive IT monitoring capability and provides a better ROI on your Splunk investment – leveraging the product expertise of the specific product vendor and the specialized IT/monitoring functionality/framework in Splunk.

    Go Splunking.


  • 217
    My Take on Cloud Computing – Part II

    No brainer, this is what any SAAS industry person would say why a customer should buy a SAAS product ..actually a service ..over an on-premise software. Still, the adoption of SAAS is in its early stages. Why then the uptake is not that aggressive ? Variety of reasons is the answer ..

    IT establishments have long known the perils and pitfalls of buying software and requisite hardware and maintaining these 24×7. Appliances eased this to a great degree. Still, there is need for some sort of a server/data center like facility, with power, A/C and networking. But you could put the box in, got your admin console and you can get started using the product. Upgrades were automatic or helped on by the vendor. But hardware does go obsolete and has to be factored. Virtual applicances that run on commodity hardware address this to great extent. For example, Java appliance vendor Azul now sells a virtual appliance/Java VM that can run on bare metal or any hypervisor.

    Be it appliances or on-premise software still commercial establishments like to have control over their application and more specifically their data. The fact that the application and data is hosted somewhere makes many executives nervous. It is quite possible that fear is heightened by the IT manager who are used to being the custodians of the company’s software machinery. Once SAAS companies, explain the multi-tenancy and the data center security standards, this fear recedes to the background. But the application customization comes to the fore. Many companies do allow customization to a great deal but probably not to the same extent as on-premise software. Some like Salesforce, even allow uploading of data from other sources and creating reports joined with the application data and the uploaded data.

    SAAS companies need to navigate these sensitive issues before making the deal. A quick customization, proof-of-concept, with the application customized to some extent for the customer goes a long way in convincing the customer. The customers IT can continue to be a partner in the process uploading user or any other raw data and downloading application data to their data warehouse or for any integration with other systems. There is still security and application administration to be done by the customer’s IT – i.e IT can now focus on IT, instead of cutting software purchase/maintenance deals. No more, scheduling downtimes to install or upgrade software/hardware/appliances. There is no need to have engineers on call over the weekend/odd times to keep the system up and running.

    (If you are thinking about soft appliances on the Cloud/IAAS instead of a full-fledged SAAS ..hold that thought, I will get to it share my views/opinions on it after this)

    Surely, we can see the tide tuning in favor of SAAS software. Still, the major reason SAAS would be a easy sell is there is no “Capital Expenditure”. You just pay for the service as you go. Maybe you can get a deal paying upfront for a few years. This is the bottom line folks..

    So if the customers are convinced or can be convinced to buy a SAAS product/service why aren’t all software companies building or converting to SAAS .I will explore that in my next blog post.


Please fill the following information to continue with the download