Posts

How Spark runs on Cluster?

Image
 As mentioned in the last article ( https://dataengineeringfromscratch.blogspot.com/2021/02/what-is-apache-spark.html ), the spark framework primarily comprises of Drivers, Executors and Cluster Manager. Now its time to further go in detail of how exactly spark runs on a cluster (Internally & Externally) What is a Cluster? Spark   is a parallelly run framework comprising of multiple machines/servers involved in the task execution. Cluster is nothing but a formal way to say group of machines/servers. Now, it makes more sense why Cluster Manager is called so. This is because it manages the cluster which comprises of lot machines. What is a Worker Node? Worker node are the ones on which cluster manager assign drivers and executors when an Spark application is submitted to the Cluster Manager What is Execution Mode? An Execution Mode gives you an idea of how resources are physically allocated when one runs a spark application. It is majorly of three types: 1) Cluster Mode...

What is Apache Spark?

Image
 Data has become huge now. And in foreseen future, we expect it to grow exponentially. With this much of Data(BigData), besides storage, computing is a major bottleneck that is faced by Organizations, thus opening prospects for Data Engineer. Apache Spark is by far, arguably, the best way to achieve this in a UNIFIED and  PARALLEL way. UNIFIED Apache is not limited to only transforming data. It is one stop solution to ingest data i.e. simple data loading, transforming data, querying(SQL) data, Machine Learning and also dealing with Streaming computation. All these could be achieved by using Spark. The user can opt for any programming languages of his/her choice - Scala, Java, Python and R, all of which has libraries for diverse tasks, as mentioned above. PARALLEL What makes Spark so special? Why there is such a huge demand for Spark resources nowadays?  Say, Task to be achieved is to clean a room. Assigning one cleaner to complete the task definitely would be more time ...