Top Apache Spark Interview Questions and Answers 2023

1. What is Apache Spark? How does it compare to MapReduce?

Apache Spark is a data processing framework with an advanced execution engine that supports a cyclic data flow and in-memory computing. It is capable of accessing diverse data sources like HDFS, HBase and Cassandra.

Apache Spark is up to 100 times faster than MapReduce and uses much better Machine Learning applications as well.
It has in-build data storage, unlike MapReduce that uses hard disk storage.
Spark allows integration but is not dependent on Hadoop, whereas MapReduce cannot run without Hadoop.

2. What are the key features of Apache Spark?

Spark allows Hadoop integration and can also run on Cloud
It has an interactive language shell, Scala.
Spark is made up of Resilient Distributed Datasets or RDDs, which can be cached across computing nodes within a cluster.
Apache Spark supports multiple analytics tools required for interactive query analysis, real-time analysis and graph processing.

3. What are Resilient Distributed Datasets?

Resilient Distributed Datasets are abbreviated to RDDs and make up the fundamental data structure of Apache Spark. They are embedded in what is called Spark Core. The data that is separated into an RDD is immutable and distributed. RDDs are a fault-tolerant collection of elements that can be operated upon parallelly. They can be classified as:

Parallelized collections: All RDDs run parallel to each other
Hadoop datasets: All RDDs performing a function are recorded in HDFS or any other storage system.

4. What is Apache Mesos and how can you connect Spark to it?

Apache Mesos separates CPU, memory, storage and other computing resources from physical or virtual machines. This makes it easy to build and run fault-tolerant and elastic distributed systems effectively. Spark can connect to Mesos in four simple steps:

Configure the Spark driver program to connect with Mesos
Add the Spark binary package to a location that Mesos can access
Install Spark in the same location as Mesos
In the file where Spark is installed, configure spark.mesos.executor.home.

5. What are Spark Datasets?

Data structures within Spark give the JVM object the same benefits as RDDs, accompanied by a Spark engine that is SQL optimised. These data structures are called Spark Datasets.

Additional Read: What are the Advantages of Cloudera Hadoop Developer Certification?

6. Which are the most frequently used Spark ecosystems?

Spark SQL (Shark) is the most common for developers
Spark Streaming for processing live data streams
GraphX for generation and computing
MLlib (Machine Learning algorithms)
SparkR for promoting R programming in the Spark engine

7. What is Spark SQL?

Also known as Shark, Spark SQL is a newly introduced Spark module. Its primary function is structured data processing. Through SQL, Spark can carry out relational SQL queries on data. The core of this component supports SchemaRDD, a different type of RDD from the regular ones. A SchemaRDD is made up of row objects and schema objects that define the data type of each column within a row. It is similar to a table found in relational databases.

8. What are the functions of Spark SQL?

Spark SQL can

Load data from multiple structured sources
Query data with SQL statements, both within a Spark program and using external tools that connect to it via JBC, ODBC and other standard database connectors, using tools like Tableau.
Enable integration between SQL and regular code written using Python, Java or Scala. This includes being able to join RDDs and SQL tables, and expose custom functions of SQL.

9. What are the different types of Cluster Managers in Spark?

There are three major types of Cluster Managers that are supported by the Spark framework.

Standalone - a basic Cluster Manager that is used to set up a cluster
Apache Mesos - A fairly common Cluster Manager that runs Hadoop, MapReduce and other similar applications
YARN - A Cluster Manager that is accountable for resource management in Hadoop

10. What is a Parquet file? List its advantages.

Parquet is a format supported by many data processing systems and is columnar in nature. This file enables Spark to read and write operations.

There are many advantages to having a Parquet file.

You can fetch specific columns and access them easily
It takes up less space
It follows type-specific encoding
It supports limited I/ O operations

11. What is shuffling? When does it occur?

In Spark, data is redistributed across partitions, which may lead to the movement of data across the executors. This redistribution process is known as shuffling, an operation that is implemented differently in Spark when compared to Hadoop.

There are 2 important compression parameters while shuffling.

spark.shuffle.spill.compress - this checks if the engine will compress the shuffle output or not
spark.shuffle.spill.compress - this decides if the immediate shuffle spill files need to be compressed or not.

Shuffling occurs when two tables are being joined or when you’re performing byKey operations like GroupbyKey or ReducebyKey.

12. What is an Action in Spark?

In Spark, Actions make it possible to transfer data from an RDD to a local machine. Reduce () and Take () are functions of Actions. The Reduce () function is carried out only when the Action repeats one by one, till only one value is left. The Take () function adds all RDD values to the local machine where information is desired.

13. What is Spark Streaming?

Stream processing, an extension to the Spark API that allows stream processing of live data streams, is supported on Spark. In the Streaming process, data from different sources such as Kafka, Flume and Kinesis is processed and then added to file systems, live dashboards and databases. Stream processing is similar to batch processing. The same way that input data is divided into batches for batch processing, it is divided into streams for Stream processing.

14. What is Caching in Spark Streaming?

Also known as Persistence, Caching is a technique that is used to optimise Spark computations. DStreams work in a way similar to RDDs and allow developers to persist the stream’s data into its memory. In other words, using the Persist() function on a DStream will embed every RDD of that DStream into the program’s memory. This helps to save partial results and reuse them easily in later stages. The default persistence level is to replicate the data to two nodes for fault-tolerance and for input streams that receive data over the network.

15. What is GraphX?

GraphX is used for graph processing. It can build and transform interactive graphs and enables programmers to structure and visualise data even at a very large scale.

16. What is PageRank?

PageRank is an algorithm unique to GraphX and can be defined as the measure of each vertex in a graph. This is used widely on social media platforms where a user with a huge following will always be ranked high on the respective platform.

17. How do you convert a Spark RDD into a Dataframe?

An RDD can be converted into a Dataframe in either one of the following two ways:

Using the helper function - toDF import com.mapr.db.spark.sql._ val df = sc.loadFromMapRDB(<tablename>).where(field(“first_name”) === “Peter”).select(“_id”, “first_name”).toDF()
Using SparkSession.createDataFrame, you can convert an RDD to a DataFrame called createDataFrame on a SparkSession object.def createDataFrame(RDD, schema:StructType)

18. Which types of operations are supported by RDDs?

There are two types of operations supported by RDDs:

Transformations: These are operations performed on an RDD to create a new RDD that contains the results.
Actions: These operations give back a value after running a computation on an RDD.

19. Explain the importance of broadcast variables.

Broadcast variables enable programmers to hold on to a read-only variable that gets cached on every machine instead of shipping a copy of it with relevant tasks. These variables can be used to share a copy of large input datasets with each node very efficiently. Using efficient broadcast algorithms, Spark distributes broadcast variables to reduce communication costs.

scala> val broadcastVar = sc.broadcast(Array(1,2,3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)
scala>broadcastVar.value
resO: Array[Int] = Array(1,2,3)

20. Does Apache Spark have checkpoints?

Apache Spark offers an API to add and manage checkpoints. Checkpointing is a process that is used to make streaming applications resilient to failures. It lets you save the data and metadata within a checkpointing directory. If there is a malfunction of any sort, Spark can recover your data and restart from where you got cut off.

Checkpointing is used for two types of data in Spark:

Metadata Checkpointing: Metadata refers to data about data. Metadata checkpointing meaning saving your metadata to fault-tolerant storage like HDFS. Metadata includes configurations, DStream operations and incomplete batches.
Data Checkpointing: In data checkpointing, RDDs are saved to reliable storage because some stateful transformations need it. In those cases, the RDD that is coming up depends on the RDDs of previous batches.

21. List the different levels of persistence in Apache Spark.

DISK_ONLY - The RDD partitions are stored only on the disk
MEMORY_ONLY_SER - The RDD is stored as a serialised Java object with a one-byte array per partition
MEMORY_ONLY - The RDD is stored as a deserialised Java object in the JVM.
OFF_HEAP - This level works like MEMORY_ONLY_SER, except that data is stored in off-heap memory
MEMORY_AND_DISK - The RDD is stored as a deserialised Java object in the JVM. Additions partitions are stored on a disk in case the entire JDD does not fit in the memory.

Enquire Now