what is Apache Spark?
Apache Spark is an open-source distributed general-purpose cluster-computing framework. Essentially, open-source means the code can be freely used by anyone. Apache Spark is a tool for Running Spark Applications. It was originally developed at UC Berkeley in 2009. and later owned by Apache software foundation.
SPARK was also the most active of all of the open source Big Data applications, with over 500+ contributors from more than 150+ organizations in the digital world. It provides high-level API. For example, Java, Scala, Python, and R.
Why do we move to new technology(Hadoop map reduce is already exists)?
MapReduce requires data to be serialized to disk between each step, which means that the I/O cost of a MapReduce job is high, making interactive analysis and iterative algorithms very expensive. In speed, Spark is 100 times faster than Bigdata Hadoop and 10 times faster than accessing data from disk.
Are there any needs for apache spark?
In the industry, there is a need for general-purpose cluster computing tool as:
- Hadoop MapReduce can only perform batch processing.
- Apache Storm / S4 can only perform stream processing.
- Apache Impala / Apache Tez can only perform interactive processing.
- Neo4j / Apache Giraph can only perform to graph processing.
Hence in the industry, there is a big demand for a powerful engine that can process the data in real-time (streaming) as well as in batch mode. There is a need for an engine that can respond in sub-second and perform in-memory processing.
We can perform various functions with Spark.
- Spark Core: Spark Core is the foundation of the overall project. It delivers speed by providing in-memory computations capability, and embedded with a special collection is called RDD(resilient distributed dataset).
- SparkR: The key component of SparkR is SparkR DataFrame. DataFrames are a fundamental data structure for data processing in R.
- SQL Operations: It has its own SQL engine called Spark SQL. It covers features of both SQL and Hive.
- Machine Learning: Spark has a Machine Learning Library, MLib. It can perform Machine Learning without the help of MAHOUT.
- Graph Processing(X): Graph performs Graph processing by using GraphX component.
- Spark streaming: Spark Streaming uses Spark Core’s fast scheduling capability to perform the streaming analysis.
All the above features are in-built for Spark. These can be run in cluster managers such as Hadoop, Apache Mesos, and YARN framework.
Will Hadoop be replaced by the spark?
First of all, you should understand what makes spark different from Hadoop by the following the diagram
Hadoop can never be replaced by Spark. They both are like cow and oxen. But there is something You need to know.
Hadoop Ecosystem consists of three layers that is
- HDFS – Most reliable storage layer
- YARN – Resource Management Layer
- MapReduce – Processing Layer.
Spark does not have any storage layer or resource management system like Hadoop. So, Apache Spark can Replace Hadoop MapReduce and not Hadoop as a whole. This is because Spark can handle any type of requirements such
Spark’s rich resources have almost all the components of Hadoop. For example, we can perform batch processing in Spark and real-time data processing too. It has its own streaming engine called the Spark Streaming Engine that can process the streaming data.
There are other factors like Blazing Fast Speed, Dynamic in Nature, In-Memory Computation, Reusability, Fault Tolerance, ease of Management, real-time analysis, Support Multiple Languages, latency, streaming on which Hadoop MapReduce and Apache Spark could be compared.
Unlike Hadoop, Spark does not come with its own file system – instead, it can be integrated with many file systems including Hadoop’s HDFS, MongoDB and Amazon’s S3 system.
Still data scientists prefer to use spark framework.
Share if you like reading the article. Stay tuned with bleedbytes.