article

Everything you need to know about Apache Hadoop

10 min read

1.What Is Apache Hadoop?

Hadoop is an open source framework which is used to store a large number of datasets.Hadoop is provided for data storage, data access, data processing and security operations. Many organisations like Facebook, Yahoo, Google, Twitter, LinkedIn and much more are used Hadoop for storage purpose because Hadoop storing a large amount of data quickly.Hadoop is written in Java and is not OLAP (online analytical processing). It is used for batch/offline processing.Moreover, it can be scaled up just by adding nodes in the cluster.

Also Read: Comparision between Cortana, Siri, and Google assistant

2.History about Hadoop?

Hadoop was created by Doug Cutting who had created the Apache Lucene(Text Search), which is the origin in Apache Nutch(Open source search Engine).Hadoop is a part of Apache Lucene Project.Actually, Apache Nutch was started in 2002 for working crawler and search system.Nutch Architecture would not scale up to billions of pages on the web.

In 2003 Google had published one Architecture called Google Distributed Filesystem(GFS), which solved the storage need for the very large files generated as a part of the web crawl and indexing process.

In 2004 based on GFS architecture Nutch was implementing open source called the Nutch Distributed Filesystem (NDFS).In 2004 Google was published MapReduce, In 2005 Nutch developers had to work on MapReduce in Nutch Project.Most of the Algorithms had been ported to run using MapReduce and NDFS.

In February 2006 they moved out of Nutch to form an independent subproject of Lucene called Hadoop.At around the same time, Doug Cutting joined Yahoo!, which provided a dedicated team and the resources to turn Hadoop into a system that ran at web scale. This was demonstrated in February 2008 when Yahoo! announced that its production search index was being generated by a 10,000-core Hadoop cluster.

In January 2008, Hadoop was made its own top-level project at Apache, confirming its success and its diverse, active community. By this time, Hadoop was being used by many other companies besides Yahoo!, such as Last.fm, Facebook, and the New York Times.

In April 2008, Hadoop broke a world record to become the fastest system to sort a terabyte of data. Running on a 910-node cluster, Hadoop sorted one terabyte in 209 seconds (just under 3½ minutes), beating the previous year’s winner of 297 seconds.

3.Modules of  Hadoop:

Hadoop

  1. Scheduler
  2. Application Manager Resource Manager’s Scheduler is: Responsible to schedule required resources to Applications (that is Per-Application Master).

Stay tuned! with Bleedbytes