What is Big Data ?

This has been a buzzword ( with cloud computing ), in the IT ecosystem.

If you google "Big Data" , you end up with several definitions. But interestingly you will find that most of the definitions are derived from a Gartner whitepaper which was published in 2001 defining the 3D Data Model written by "Doug Laney". As per the paper, traditional data management techniques were not enough to meet the needs and called for a change in the approach. The reason for which is the changing nature of Data Volume, Data Velocity, Data Variety.

Now putting that in terms of defining what big data is, simply it is the data that is too large and complex to be handled by the standard / primitive database technologies used by organisations. There are certain attributes which are assessed and to define the data to be "big". The attributes are below :

  1. Volume : As per definition, it is the huge volumes of data which are generated in today's world, Some examples are the social networking sites like Facebook, Twitter and websites like YouTube where on a daily basis millions of users around the globe are active. "From 2005 to 2020, the measure of all data created, replicated and consumed in a single year will grow by a factor of 300, from 130 exabytes to 40,000 exabytes, or 40 trillion gigabytes (more than 5,200 gigabytes for every man, woman, and child in 2020). From now until 2020, the digital universe will about double every two years " - From The Digital Universe in 2020 whitepaper by EMC. Big Data deals in the volumes in tune of Terabytes,Petabytes, Exabytes,Zettabytes
  2. Velocity : Velocity of the data is the speed in which it is being created, stored and analysed. We read about the examples of sites which are creating high volumes of data. With an increased usage of these platforms and creation of similar platforms in future, I think it will be a fair call to say that the velocity in which the data is being created is unimaginable and cannot be measured. If we go through the official statistics of YouTube alone, 300 hours of video are uploaded to YouTube every minute. And the official statistics of Facebook says that there are 936 million daily active users on average for March 2015.  Bigdata also deals in performing analytics on the huge data being generated at this pace often at realtime and which requires immediate response.
  3. Variety : Earlier, the data to be managed was more structured but with the change in the requirements, we needed a system which can fit data of all kinds i.e. Structured, unstructured, different formats. Big Data deals with that requirement.
We discussed the 3 key V's , which defines "Big Data". But with further implementations and usage of Big Data, there were additional V's which were added by different organisations which are worth to consider if the result of your Big Data processes is critical to you and your business. However, I feel these additional V's does not define Big Data, it is more of what you desire to have when you plan a Big Data implementation.

The 4 additional V's are as below:
  1. Veracity : We discussed about "Big Data" above, which means a variety of data in large volumes which is complex and generated and accessed in realtime. This additional V talks about the Truthfulness and accuracy of the data. We could assess the integrity of the data on the structured and primitive databases by applying normalisation techniques. With Big data we use different tools and algorithms which can assess the data to be meaningful and true for the business, which is also important for the business to reduce cost on storing and processing of data which is not of use.
  2. Validity : This is quite close to what I just mentioned above, Veracity. However, this is more related of the quality and accuracy of the data with regard to the intended usage. Data may not have veracity issues, but it might not be valid if it is not properly understood. On the contrary the same set of data might be valid for one application, but not for another, hence it is important to verify a relationship to some extent between the elements of data we are dealing with. It is more related to drawing logical inferences from the matching data
  3. Value : This refers to the value added by the variety of data in large volumes which is complex and generated and accessed in realtime. Data value must exceed its cost, ownership and management. Data alone is not valuable, the value is actually in how we transition the data to meaningful information and knowledge which will assist the organisations which rely heavily on data analysis for decision making. Big data adds value by providing complex, advanced, predictive business analysis and insights.
  4. Visibility : Some definitions replace this V with Visualisation as well, which actually defines more or less the same aspect that the variety of data in large volumes which is complex and generated and accessed at realtime should be visible or can be visualised using tools etc, else the data from disparate resources which is available in Big Data and not visible is not of any use.
You can find an opensource distribution from Apache to implement and test Big Data in your environment. There are some other vendors as well for for big data distributions, naming a few are cloudera, MapR. Some key Data Management vendors include Datastax and IBM.

I hope the above gives you some idea what "Big Data" is all about, this being one of the biggest post of mine so I will coin this post as my "Big Post" ;-).

I plan to write more about the architecture and how to setup your own environment using the Apache Hadoop distribution in coming weeks, your feedback / ideas are appreciated.

So, is your data "Big" enough ?........ Thanks for reading the post.