Apache Spark is a lucrative career path in today’s IT world. If you are preparing for an interview, it means you have completed your training and are now looking for a job. Large firms like Amazon, JPMorgan and eBay have also adopted Apache Spark for their Big Data deployments.
To prepare you well for your interview, here are some of the top Apache Spark interview questions both for beginners and experts.
Apache Spark is a data processing framework with an advanced execution engine that supports a cyclic data flow and in-memory computing. It is capable of accessing diverse data sources like HDFS, HBase and Cassandra.
Resilient Distributed Datasets are abbreviated to RDDs and make up the fundamental data structure of Apache Spark. They are embedded in what is called Spark Core. The data that is separated into an RDD is immutable and distributed. RDDs are a fault-tolerant collection of elements that can be operated upon parallelly. They can be classified as:
Apache Mesos separates CPU, memory, storage and other computing resources from physical or virtual machines. This makes it easy to build and run fault-tolerant and elastic distributed systems effectively. Spark can connect to Mesos in four simple steps:
Data structures within Spark give the JVM object the same benefits as RDDs, accompanied by a Spark engine that is SQL optimised. These data structures are called Spark Datasets.
Additional Read: What are the Advantages of Cloudera Hadoop Developer Certification?
Also known as Shark, Spark SQL is a newly introduced Spark module. Its primary function is structured data processing. Through SQL, Spark can carry out relational SQL queries on data. The core of this component supports SchemaRDD, a different type of RDD from the regular ones. A SchemaRDD is made up of row objects and schema objects that define the data type of each column within a row. It is similar to a table found in relational databases.
Spark SQL can
There are three major types of Cluster Managers that are supported by the Spark framework.
Parquet is a format supported by many data processing systems and is columnar in nature. This file enables Spark to read and write operations.
There are many advantages to having a Parquet file.
In Spark, data is redistributed across partitions, which may lead to the movement of data across the executors. This redistribution process is known as shuffling, an operation that is implemented differently in Spark when compared to Hadoop.
There are 2 important compression parameters while shuffling.
Shuffling occurs when two tables are being joined or when you’re performing byKey operations like GroupbyKey or ReducebyKey.
In Spark, Actions make it possible to transfer data from an RDD to a local machine. Reduce () and Take () are functions of Actions. The Reduce () function is carried out only when the Action repeats one by one, till only one value is left. The Take () function adds all RDD values to the local machine where information is desired.
Stream processing, an extension to the Spark API that allows stream processing of live data streams, is supported on Spark. In the Streaming process, data from different sources such as Kafka, Flume and Kinesis is processed and then added to file systems, live dashboards and databases. Stream processing is similar to batch processing. The same way that input data is divided into batches for batch processing, it is divided into streams for Stream processing.
Also known as Persistence, Caching is a technique that is used to optimise Spark computations. DStreams work in a way similar to RDDs and allow developers to persist the stream’s data into its memory. In other words, using the Persist() function on a DStream will embed every RDD of that DStream into the program’s memory. This helps to save partial results and reuse them easily in later stages. The default persistence level is to replicate the data to two nodes for fault-tolerance and for input streams that receive data over the network.
GraphX is used for graph processing. It can build and transform interactive graphs and enables programmers to structure and visualise data even at a very large scale.
You May Also Like: What is Big Data Analytics and Why Is It Important?
PageRank is an algorithm unique to GraphX and can be defined as the measure of each vertex in a graph. This is used widely on social media platforms where a user with a huge following will always be ranked high on the respective platform.
An RDD can be converted into a Dataframe in either one of the following two ways:
There are two types of operations supported by RDDs:
Broadcast variables enable programmers to hold on to a read-only variable that gets cached on every machine instead of shipping a copy of it with relevant tasks. These variables can be used to share a copy of large input datasets with each node very efficiently. Using efficient broadcast algorithms, Spark distributes broadcast variables to reduce communication costs.
scala> val broadcastVar = sc.broadcast(Array(1,2,3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)
scala>broadcastVar.value
resO: Array[Int] = Array(1,2,3)
Apache Spark offers an API to add and manage checkpoints. Checkpointing is a process that is used to make streaming applications resilient to failures. It lets you save the data and metadata within a checkpointing directory. If there is a malfunction of any sort, Spark can recover your data and restart from where you got cut off.
Checkpointing is used for two types of data in Spark:
Aarav Goel has top education industry knowledge with 4 years of experience. Being a passionate blogger also does blogging on the technology niche.