Is Apache Spark the end of MapReduce?
All the previous articles have discussed about the Hadoop framework and the way it performs actions on data through MapReduce programs.However, there are some drawbacks to the MapReduce framework.Let us analyze the key differences between MapReduce and Apache Spark in a detailed fashion.
MapReduce runs in three phases.
1)The mapper program is fed data from HDFS and corresponding metadata is fetched and map operation is applied on the data.
2)The temporary data is written into local filesystem instead of HDFS for the reducer to apply operation on the key value pairs returned by Mapper.
3)The reducer picks data from the local file system(output of Mapper) and writes the output back to the HDFS systems.
However there are inherent problems in this regard.Imagine we have a case where the MapReduceprogram fails due to some network/read/write error and the output is not output.This might not be a bg issue incase the data read is small or there is no urgent need for the output to be deployed.However, in the case we have huge data running complex Mapreduce programs , in case the the MapReduce job fails, we have no option except to rerun the job.However, in case of bigger failures such as namenode data failure in Hadoop cluster or network outage, there is a significant delay and interruption caused in the Data pipeline.
The second major problem with the Map Reduce framework is the complexity of joins.Any join function over a structured dataset is not an easy task and takes some good programming and understanding of the framework to develop the same.
Similar to Hadoop, Spark also provides streaming making it seamlessly possible to work in Java Python and Scala.However,since spark is natively written in Scala which runs on JVM, it is always better to develop programs in Scala.
Other great advantage of Spark is caching data transformations.When applying transformations,one needs to always keep the current state of data for later debug and memory purposes.In a complex pipeline, there are several instances where different teams use data after a set transformations.To help you understand this better imagine a healthcare company processing the medical claims data which is used by data warehousing, software/app teams and analytical and business teams.They have different use cases and require data at different transformation levels.Here is an example pipeline.
Similar to Hadoop, Spark also provides streaming making it seamlessly possible to work in Java Python and Scala.However,since spark is natively written in Scala which runs on JVM, it is always better to develop programs in Scala.
Other great advantage of Spark is caching data transformations.When applying transformations,one needs to always keep the current state of data for later debug and memory purposes.In a complex pipeline, there are several instances where different teams use data after a set transformations.To help you understand this better imagine a healthcare company processing the medical claims data which is used by data warehousing, software/app teams and analytical and business teams.They have different use cases and require data at different transformation levels.Here is an example pipeline.
I enjoy what you guys are usually up too. This sort of clever work and coverage! Keep up the wonderful works guys I’ve added you guys to my blog roll.
ReplyDeleteOnline training in USA