0.0 Resources

Before I begin, I want to list all the resources I used to write these notes. Please note that I may have forgotten some of them because I wrote these notes over the past months, and I didn't cite all the sources I used. Really sorry about that. The main ones are:

Simplilearn YouTube channel: https://www.youtube.com/watch?v=aReuLtY0YMI&ab_channel=Simplilearn
LearningJournal Youtube Channel: https://www.youtube.com/watch?v=fyTiJLKEzME&ab_channel=LearningJournal (3 main videos, Apache Spark series)
Learning Spark: Lightning-Fast Data Analytics - Jules S. Damji, Brooke Wenig, Tathagata Das, and Denny Lee (Second Edition 2020, O'Reilly Media)
Spark: The Definitive Guide - Bill (B.) Chambers, Matei (M.) Zaharia (2018, O'Reilly Media)

1. Spark History

To understand the origins of Spark, we can break down this brief paragraph into four parts. Each part describes its genesis, inspiration, and adoption within the community.

1.1 Big Data Problem

For most of their history, computers became faster every year though processor speed increases: as a result, applications also automatically became faster every year, without any changed needed to their code. This trend in hardware stopped around 2005: due to hard limits in heat dissipation, hardware developers stopped making individual processors faster, and switch toward adding more parallel CPU cores all running at the same speed. This change meant that suddenly applications needed to be modified to add parallelism in order to run faster, which is the stage for the new programming models such as Apache Spark.

On top of that, the technologies for storing and collecting data did not slow down appreciably in 2005. The cost to store 1 TB of data continue to drop by roughly two times every 14 months, meaning that it was very inexpensive for organisations to store large amount of data. On the other hand, technologies for collecting data (sensors, camera, etc.) continue to drop in cost and improve in resolution.

The end result is a world in which collecting data is extremely inexpensive, but processing it requires large, parallel computations, often on cluster of machines.

1.2. Distributed Computing at Google

When we think of scale, we immediately think of the ability of Google’s search engine to index and search data on internet at lightning speed.

Traditional tool, like RDBMSs, were able to handle the scale at which Google wanted to build and search the internet’s indexed documents. The need for new tools suitable for this purpose led to the development of the Google File System (GFS), MapReduce (MR), and Bigtable.