Ash's Notebook: May 2015

In my previous blogpost we read about BigData, we also discussed about some vendors who provide distributions of hadoop for implementations of BigData Analysis and Computation.The one which is ideal for us to evaluate Hadoop is Apache Hadoop.

Apache Hadoop is a framework for running applications on large cluster built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework.

Before we install hadoop for evaluation we also need to be understand that we can install Hadoop in 3 different modes depending on the requirement.

Stand Alone Mode (Single Node Cluster): All deamons (HBase, ZooKeeper etc) runs inside a single JVM process. It uses local storage instead of a DFS (Distributed File System). This mode is generally used for testing and debugging purposes.
Pseudo Distributed Mode (Single Node Cluster): Each Hadoop Deamon runs on a separate JVM process, hence there will be multiple JVM processes running on the same machine or node. This is used to mimic fully distributed hadoop setup. Good for testing and prototyping, however this should not be used in production as this will lead to performance issues.
Fully Distributed mode (Multi Node Cluster): In this mode, all the deamons are scattered across several commodity machines, DFS is leveraged. Centralised data structure allows multiple machines to contribute processing powers and storage to the cluster.

We will see the installation and configuration steps in my next post..

This has been a buzzword ( with cloud computing ), in the IT ecosystem.

If you google "Big Data" , you end up with several definitions. But interestingly you will find that most of the definitions are derived from a Gartner whitepaper which was published in 2001 defining the 3D Data Model written by "Doug Laney". As per the paper, traditional data management techniques were not enough to meet the needs and called for a change in the approach. The reason for which is the changing nature of Data Volume, Data Velocity, Data Variety.

Now putting that in terms of defining what big data is, simply it is the data that is too large and complex to be handled by the standard / primitive database technologies used by organisations. There are certain attributes which are assessed and to define the data to be "big". The attributes are below :

Volume : As per definition, it is the huge volumes of data which are generated in today's world, Some examples are the social networking sites like Facebook, Twitter and websites like YouTube where on a daily basis millions of users around the globe are active. "From 2005 to 2020, the measure of all data created, replicated and consumed in a single year will grow by a factor of 300, from 130 exabytes to 40,000 exabytes, or 40 trillion gigabytes (more than 5,200 gigabytes for every man, woman, and child in 2020). From now until 2020, the digital universe will about double every two years " - From The Digital Universe in 2020 whitepaper by EMC. Big Data deals in the volumes in tune of Terabytes,Petabytes, Exabytes,Zettabytes
Velocity : Velocity of the data is the speed in which it is being created, stored and analysed. We read about the examples of sites which are creating high volumes of data. With an increased usage of these platforms and creation of similar platforms in future, I think it will be a fair call to say that the velocity in which the data is being created is unimaginable and cannot be measured. If we go through the official statistics of YouTube alone, 300 hours of video are uploaded to YouTube every minute. And the official statistics of Facebook says that there are 936 million daily active users on average for March 2015. Bigdata also deals in performing analytics on the huge data being generated at this pace often at realtime and which requires immediate response.
Variety : Earlier, the data to be managed was more structured but with the change in the requirements, we needed a system which can fit data of all kinds i.e. Structured, unstructured, different formats. Big Data deals with that requirement.

We discussed the 3 key V's , which defines "Big Data". But with further implementations and usage of Big Data, there were additional V's which were added by different organisations which are worth to consider if the result of your Big Data processes is critical to you and your business. However, I feel these additional V's does not define Big Data, it is more of what you desire to have when you plan a Big Data implementation.

The 4 additional V's are as below:

Veracity : We discussed about "Big Data" above, which means a variety of data in large volumes which is complex and generated and accessed in realtime. This additional V talks about the Truthfulness and accuracy of the data. We could assess the integrity of the data on the structured and primitive databases by applying normalisation techniques. With Big data we use different tools and algorithms which can assess the data to be meaningful and true for the business, which is also important for the business to reduce cost on storing and processing of data which is not of use.
Validity : This is quite close to what I just mentioned above, Veracity. However, this is more related of the quality and accuracy of the data with regard to the intended usage. Data may not have veracity issues, but it might not be valid if it is not properly understood. On the contrary the same set of data might be valid for one application, but not for another, hence it is important to verify a relationship to some extent between the elements of data we are dealing with. It is more related to drawing logical inferences from the matching data
Value : This refers to the value added by the variety of data in large volumes which is complex and generated and accessed in realtime. Data value must exceed its cost, ownership and management. Data alone is not valuable, the value is actually in how we transition the data to meaningful information and knowledge which will assist the organisations which rely heavily on data analysis for decision making. Big data adds value by providing complex, advanced, predictive business analysis and insights.
Visibility : Some definitions replace this V with Visualisation as well, which actually defines more or less the same aspect that the variety of data in large volumes which is complex and generated and accessed at realtime should be visible or can be visualised using tools etc, else the data from disparate resources which is available in Big Data and not visible is not of any use.

You can find an opensource distribution from Apache to implement and test Big Data in your environment. There are some other vendors as well for for big data distributions, naming a few are cloudera, MapR. Some key Data Management vendors include Datastax and IBM.

I hope the above gives you some idea what "Big Data" is all about, this being one of the biggest post of mine so I will coin this post as my "Big Post" ;-).

I plan to write more about the architecture and how to setup your own environment using the Apache Hadoop distribution in coming weeks, your feedback / ideas are appreciated.

So, is your data "Big" enough ?........ Thanks for reading the post.

Ash's Notebook

Sunday, May 31, 2015

Hadoop Installation Modes

Wednesday, May 20, 2015

What is Big Data ?

The Azure Synapse Resource Provider Error