Apache MapReduce and HDFS
Apache Hadoop has been originated from Google’s Whitepapers:
1. Apache HDFS => derived from GFS => Google File System.
2. Apache MapReduce => derived from => Google MapReduce
3. Apache HBase => derived from => Google BigTable.
HDFS and MapReduce are the two main components of Hadoop, where HDFS provides the required infrastructure and MapReduce provides the ‘Programming’ aspect. Though HDFS is at present a subproject of Apache Hadoop, it was formally developed as an infrastructure for the Apache Nutch web search engine project.
To understand the scalability of Hadoop from one-node cluster to a thousand-nodes cluster, we need to first understand Hadoop’s file system, that is, HDFS (Hadoop Distributed File System).
HDFS (Hadoop Distributed File System):
HDFS is a distributed and scalable file system designed for storing very large files with streaming data access patterns, running clusters on commodity hardware.
Even though it has many similarities with traditional distributed file systems, there are some noticeable differences between these. Following are the assumptions behind HDFS:
Assumptions behind HDFS:
1. Large Data Sets:
HDFS best fits if you are dealing with large data sets. It will be an underplay if HDFS is deployed to process several small data sets ranging in some megabytes or even a few gigabytes. The architecture of HDFS is designed in such a way that it is best fit to store and retrieve huge amount of data.
2. Streaming Data Access:
The feature of streaming data access is extremely important in HDFS. As HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed, especially while analyzing logs. In HDFS, reading the complete data is more important than the time taken to fetch a single record from the data.
3. Write Once, Read Many Model: WORM model
HDFS follows the write-once, read-many approach for its files and applications. It assumes that a file in HDFS once written will not be modified, though it can be access ‘n’ number of times .At present, in HDFS strictly has one writer at any time. This assumption enables high throughput data access and also simplifies data coherency issues.
4. Data Replication and Fault Tolerance:
Reducdancy is a key feature of the HDFS file system.It is assumed that the hardware is bound to fail at some point of time or the other. This disrupts the smooth and quick processing of large volumes of data. To overcome this obstacle, in HDFS, the files are divided into large blocks of data and each block is stored on three nodes: two on the same rack and one on a different rack for fault tolerance.
5. Commodity Hardware:
Clusters in HDFS run on common hardware which is non expensive, ordinary machines rather than high-availability systems. A great feature of Hadoop is that it can be installed in any average commodity hardware. We don’t need super computers or high-end hardware to work on Hadoop. This leads to an overall cost reduction to a great extent.
6. Moving Computation is better than Moving Data:
In Hadoop HDFS,the source code which needs to be executed on a dataset is moved to the machine rather than moving the data.The major advantage of this model is reduction in the network congestion and increased overall throughput of the system. The assumption is that it is often better to locate the computation closer to where the data is located rather than moving the data to the application space. To facilitate this, Apache HDFS provides interfaces for applications to relocate themselves nearer to where the data is located.
7. High Throughput:
Throughput is the amount of work done in a unit time. It describes how fast the data is getting accessed from the system and it is usually used to measure performance of the system. In Hadoop HDFS, when we want to perform a task or an action, then the work is divided and shared among different systems. So, all the systems will be executing the tasks assigned to them independently and in parallel. So the work will be completed in a very short period of time. In this way, the Apache HDFS gives good throughput. By reading data in parallel, we decrease the actual time to read data tremendously.
8. File System Namespace:
A traditional hierarchical file organization is followed by HDFS, where any user or an application can create directories and store files inside these directories. Thus, HDFS’s file system namespace hierarchy is similar to most of the other existing file systems, where one can create and delete files or relocate a file from one directory to another, or even rename a file. In general, HDFS does not support hard links or soft links, though these can be implemented if need arise.
Thus, HDFS works on these assumptions and goals in order to help the user access or process large data sets within incredibly short period of time!
After learning ‘What is HDFS’ in this write-up, further we will discuss the components of HDFS that form a significant part of the Hadoop cluster!