Everything you need to know about Apache Hadoop

1.What Is Apache Hadoop?

Hadoop is an open source framework which is used to store a large number of datasets.Hadoop is provided for data storage, data access, data processing and security operations. Many organisations like Facebook, Yahoo, Google, Twitter, LinkedIn and much more are used Hadoop for storage purpose because Hadoop storing a large amount of data quickly.Hadoop is written in Java and is not OLAP (online analytical processing). It is used for batch/offline processing.Moreover, it can be scaled up just by adding nodes in the cluster.

Also Read: Comparision between Cortana, Siri, and Google assistant

2.History about Hadoop?

Hadoop was created by Doug Cutting who had created the Apache Lucene(Text Search), which is the origin in Apache Nutch(Open source search Engine).Hadoop is a part of Apache Lucene Project.Actually, Apache Nutch was started in 2002 for working crawler and search system.Nutch Architecture would not scale up to billions of pages on the web.

In 2003 Google had published one Architecture called Google Distributed Filesystem(GFS), which solved the storage need for the very large files generated as a part of the web crawl and indexing process.

In 2004 based on GFS architecture Nutch was implementing open source called the Nutch Distributed Filesystem (NDFS).In 2004 Google was published MapReduce, In 2005 Nutch developers had to work on MapReduce in Nutch Project.Most of the Algorithms had been ported to run using MapReduce and NDFS.

In February 2006 they moved out of Nutch to form an independent subproject of Lucene called Hadoop.At around the same time, Doug Cutting joined Yahoo!, which provided a dedicated team and the resources to turn Hadoop into a system that ran at web scale. This was demonstrated in February 2008 when Yahoo! announced that its production search index was being generated by a 10,000-core Hadoop cluster.

In January 2008, Hadoop was made its own top-level project at Apache, confirming its success and its diverse, active community. By this time, Hadoop was being used by many other companies besides Yahoo!, such as Last.fm, Facebook, and the New York Times.

In April 2008, Hadoop broke a world record to become the fastest system to sort a terabyte of data. Running on a 910-node cluster, Hadoop sorted one terabyte in 209 seconds (just under 3½ minutes), beating the previous year’s winner of 297 seconds.

3.Modules of Hadoop:

Hadoop Common: Hadoop Common Module is a Hadoop Base API (A Jar file) for all Hadoop Components. All other components work on top of this module.
HDFS: HDFS(Hadoop distributed file system)designed for storing large files of the magnitude of hundreds of megabytes or gigabytes and provides high-throughput streaming data access to them. it supports the write-once-read-many model. A file once created, written, and closed need not be changed although we can append the data to the file. HDFS easily scales to thousands of nodes and petabytes of data. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. HDFS has a master/slave architecture. An HDFS cluster consists of NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time. The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode. Secondary NameNode: Secondary Namenode does something called checkpoint process, followingis the working of Secondary Namenode- 1. It gets the edit logs from the namenode in regular intervals and applies to fsimage. 2. Once it has new image, it copies back to namenode. 3. Namenode will use this fsimage for the next restart, which will reduce the startup time.

Hadoop

YARN: YARN(Yet Another Resource Negotiator) is a resource manager that knows how to allocate distributed compute resources to various applications running on a Hadoop cluster. In other words, YARN itself does not provide any processing logic that can analyze data in HDFS. Hence various processing frameworks must be integrated with YARN. YARN provides a framework for managing both MapReduce and non-MapReduce tasks of greater size and complexity. YARN, similarly to HDFS, follows the master-slave design with single ResourceManager daemon and multiple NodeManagers daemons having different responsibilities: Resource Manager: Keeps track of live NodeManagers and the amount of available compute resources that they currently have. allocates available resources to applications submitted by clients. monitors whether applications complete successfully. Resource Manager is a Per-Cluster Level Component. Resource Manager is again divided into two components:

Scheduler
Application Manager Resource Manager’s Scheduler is: Responsible to schedule required resources to Applications (that is Per-Application Master).

MapReduce: MapReduce on YARN is a framework that enables running MapReduce jobs on the Hadoop cluster powered by YARN. It provides a high-level API for implementing custom Map and Reduce functions in various languages as well as the code-infrastructure needed to submit, run and monitor MapReduce jobs. Hadoop MapReduce is a data processing framework that can be utilized to process massive amounts of data stored in HDFS. Distributed processing of a massive amount of data in a reliable and efficient manner is not an easy task. Hadoop MapReduce aims to make it easy for users by providing a clean abstraction for programmers by providing automatic parallelization of the programs and by providing framework managed fault tolerance support. MapReduce programming model consists of Map and Reduce functions. The Map function receives each record of the input data (lines of a file, rows of a database, and so on) as key-value pairs and outputs key-value pairs as the result. By design, each Map function invocation is independent of each other allowing the framework to use divide and conquer to execute the computation in parallel. This also allows duplicate executions or re-execution of the Map tasks in case of failures or load imbalances without affecting the results of the computation. Typically, Hadoop creates a single Map task instance for each HDFS data block of the input data. The number of Map function invocations inside a Map task instance is equal to the number of data records in the input data block of the particular Map task instance. Hadoop MapReduce groups the output key-value records of all the Map tasks of a computation by the key and distributes them to the Reduce tasks. This distribution and transmission of data to the Reduce tasks are called the Shuffle phase of the MapReduce computation. Input data to each Reduce task would also be sorted and grouped by the key. The Reduce function gets invoked for each key and the group of values of that key (reduce <key, list_of_values>) in the sorted order of the keys. In a typical MapReduce program, users only have to implement the Map and Reduce functions and Hadoop take care of scheduling and executing them in parallel. Hadoop will rerun any failed tasks and also provide measures to mitigate any unbalanced computations.

4.Why Hadoop is important:
- Ability to store and process huge amounts of any kind of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that’s a key consideration.
- Computing power. Hadoop’s distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have.
- Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically.
- Flexibility. Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.
- Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data.
- Scalability. You can easily grow your system to handle more data simply by adding nodes. Little administration is required.
  
  5.Facts about Hadoop:
  
  1. Import/Export Data to and from HDFS
  
  2. Data Compression in HDFS
  
  3. Transformation
  
  4. Achieve Common Task
  
  5. Combining Large Volume Data
  
  6. Ways to Analyze High Volume Data
  
  7. Debugging in Hadoop World
  
  8. Easy to Control Hadoop System
  
  9. Scalable Persistence
  
  10. Data Read and Write in Hadoop
  
  Summary
  
  In this article, we explored what can be achieved in the Big Data world using the Hadoop framework. In a future article, I’ll be discussing each area in more detail, which would help you in real time implementations and to get best out of Hadoop system.

Stay tuned! with Bleedbytes

Everything you need to know about Apache Hadoop

1.What Is Apache Hadoop?

2.History about Hadoop?

3.Modules of Hadoop:

4.Why Hadoop is important:

5.Facts about Hadoop:

Summary