AWS Certified Data Analytics - Specialty (Big Data on AWS) Quiz Questions and Answers

Answer :
  • Load the data into Spark DataFrames.
  • Use Amazon S3 Select to retrieve the data necessary for the dashboards from the S3 objects.

Explanation :

One of the speed advantages of Apache Spark comes from loading data into immutable dataframes, which can be accessed repeatedly in memory. Spark DataFrames organizes distributed data into columns. This makes summaries and aggregates much quicker to calculate. Also, instead of loading an entire large Amazon S3 object, load only what is needed using Amazon S3 Select. Keeping the data in S3 avoids loading the large dataset into HDFS.
Answer :
  • Include a session identifier in the clickstream data from the Publisher website and use as the partition key for the stream. Use the Kinesis Client Library (KCL) in the consumer application to retrieve the data from the stream and perform the processing. Deploy the consumer application on Amazon EC2 instances in an EC2 Auto Scaling group. Use an AWS Lambda function to reshard the stream based upon Amazon CloudWatch alarms.

Explanation :

Partitioning by the session ID will allow a single processor to process all the actions for a user session in order. An AWS Lambda function can call the UpdateShardCount API action to change the number of shards in the stream. The KCL will automatically manage the number of processors to match the number of shards. Amazon EC2 Auto Scaling will assure the correct number of instances are running to meet the processing load.
Answer :
  • Publish the raw social media data to an Amazon Kinesis Data Firehose delivery stream. Use Kinesis Data Analytics for SQL Applications to perform a sliding window analysis to compute the metrics and output the results to a Kinesis Data Streams data stream. Configure an AWS Lambda function to save the stream data to an Amazon DynamoDB table. Deploy a real-time dashboard hosted in an Amazon S3 bucket to read and display the metrics data stored in the DynamoDB table.

Explanation :

Amazon Kinesis Data Analytics can query data in a Kinesis Data Firehose delivery stream in near-real time using SQL. A sliding window analysis is appropriate for determining trends in the stream. Amazon S3 can host a static webpage that includes JavaScript that reads the data in Amazon DynamoDB and refreshes the dashboard.
Answer :
  • Use the Relational class in an AWS Glue ETL job to transform the data and write the data back to Amazon S3. Use Amazon Redshift Spectrum to create external tables and join with the internal tables.

Explanation :

The Relationalize PySpark transform can be used to flatten the nested data into a structured format. Amazon Redshift Spectrum can join the external tables and query the transformed clickstream data in place rather than needing to scale the cluster to accommodate the large dataset.
Answer :
  • Store the file in Amazon S3 and the object key as an attribute in the DynamoDB table.

Explanation :

Use Amazon S3 to store large attribute values that cannot fit in an Amazon DynamoDB item. Store each file as an object in Amazon S3 and then store the object path in the DynamoDB item.
Answer :
  • Reduce the propagation delay by overriding the KCL default settings.

Explanation :

The KCL defaults are set to follow the best practice of polling every 1 second. This default results in average propagation delays that are typically below 1 second. Reference:
Answer :
  • Create an Amazon Managed Streaming for Kafka cluster and ingest the data for each order into a topic. Use a Kafka consumer running on Amazon EC2 instances to read these messages and invoke the Amazon SageMaker endpoint.

Explanation :

An Amazon Managed Streaming for Kafka cluster can be used to deliver the messages with very low latency. It has a configurable message size that can handle the 1.5 MB payload.
Answer :
  • Schedule the dataset to refresh daily.

Explanation :

Datasets created using Amazon S3 as the data source are automatically imported into SPICE. The Amazon QuickSight console allows for the refresh of SPICE data on a schedule.
Answer :
  • Specify local disk encryption in a security configuration. Re-create the cluster using the newly created security configuration.

Explanation :

Local disk encryption can be enabled as part of a security configuration to encrypt root and storage volumes.
Answer :
  • Geospatial chart