In my previous blogpost we read about BigData, we also discussed about some vendors who provide distributions of hadoop for implementations of BigData Analysis and Computation.The one which is ideal for us to evaluate Hadoop is Apache Hadoop.
Apache Hadoop is a framework for running applications on large cluster built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework.
Before we install hadoop for evaluation we also need to be understand that we can install Hadoop in 3 different modes depending on the requirement.
Apache Hadoop is a framework for running applications on large cluster built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework.
Before we install hadoop for evaluation we also need to be understand that we can install Hadoop in 3 different modes depending on the requirement.
- Stand Alone Mode (Single Node Cluster): All deamons (HBase, ZooKeeper etc) runs inside a single JVM process. It uses local storage instead of a DFS (Distributed File System). This mode is generally used for testing and debugging purposes.
- Pseudo Distributed Mode (Single Node Cluster): Each Hadoop Deamon runs on a separate JVM process, hence there will be multiple JVM processes running on the same machine or node. This is used to mimic fully distributed hadoop setup. Good for testing and prototyping, however this should not be used in production as this will lead to performance issues.
- Fully Distributed mode (Multi Node Cluster): In this mode, all the deamons are scattered across several commodity machines, DFS is leveraged. Centralised data structure allows multiple machines to contribute processing powers and storage to the cluster.