Breaking News
Loading...

Hadoop 1 Vs Hadoop 2



Hadoop 1 Vs Hadoop 2:

HDFS (Hadoop Distributed File System) allows to store data in huge amounts in terms on Peta Bytes than possible on a single machine. The Peta Bytes of raw data is of no use because it’s difficult to take action on such huge amounts of data. The huge amounts of raw data needs to be converted (using data mining, machine learning, analytics) into information which is small and on which action can be taken.

Such processing might take days and months on a single machine based on the size of the data and the processing logic involved. This is one of the main reason for processing the subsets of the data on multiple machines in parallel.
While processing data in parallel, there a lot of challenges like a machine or a process going down, network failures etc. And, the input task has to be split into small tasks and assigned to multiple machines in the cluster. The partial output from these machines has to be consolidated in some fashion. All these challenges are addressed by the distributed computing models like MapReduce (MR), Bulk Synchronous Parallel (BSP), MPI (Message Passing Interface) etc. Hadoop 1 provides an environment for executing MR programs as shown below.




Every model will have some deficiencies because of which new models and different flavors of the existing models are found out continuously. Also, based on the processing requirement the appropriate model has to be picked. For example, for iterative processing as in the case of Machine Learning, MR is not a good pick as the intermediate data during iterations is persisted to the disk.Spark which is based on the concept of RDD (Resilient Distributed Datasets) is much more suitable for implementing Machine Learning algorithms.

To summarize, there are multiple distributed computing models coexist for solving different problems and also to address challenges of the existing models. Hadoop 1 allows us to write programs in only MR, while Hadoop 2 which has resource management framework YARN (Yet Another Resource Negotiator) allows to write programs in multiple distributed computing models 
- See more at: http://labstrikes.blogspot.in/2012/08/adsense-middle-blog-post.html#sthash.gQgSkqx8.dpuf
 
Toggle Footer