The Machine Learning Pipeline on AWS Quiz Questions and Answers

Answer :
  • Evaluate the model using the Receiver Operating Characteristic Curve (ROC)

Explanation :

Amazon ML provides an industry-standard accuracy metric for binary classification models called Area Under the (Receiver Operating Characteristic) Curve (AUC). AUC measures the ability of the model to predict a higher score for positive examples as compared to negative examples. Because it is independent of the score cut-off, you can get a sense of the prediction accuracy of your model from the AUC metric without picking a threshold. The Receiver Operating Characteristic (ROC) curve is a graphical plot that shows the diagnostic ability of a binary classifier system as its discrimination threshold is varied.
Answer :
  • Positive correlation

Explanation :

A correlation coefficient tells you how strong, or how weak, the relationship is between two sets of data. In Mathematics, a coefficient is usually the number that is used to multiply a variable. So for this expression: 9x, the number 9 is the coefficient. A correlation between two variables or data sets indicates that as one variable changes in value, the other variable tends to change in a specific direction. It is also called the cross-correlation coefficient, Pearson correlation coefficient (PCC), or the Pearson product-moment correlation coefficient (PPMCC).
Answer :
  • Use Sequence-to-sequence (seq2seq) algorithm with an encoder-decoder architecture

Explanation :

The Sequence-to-Sequence algorithm is a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio), and the output generated is another sequence of tokens. Example applications include machine translation (input a sentence from one language and predict what that sentence would be in another language), text summarization (input a longer string of words and predict a shorter string of words that is a summary), speech-to-text (audio clips converted into output sentences in tokens).
Answer :
  • Model Pruning

Explanation :

Model pruning aims to remove weights that don’t contribute much to the training process. Weights are learnable parameters: they are randomly initialized and optimized during the training process. During the forward pass, data passes through the model. The loss function evaluates model output given the labels; during the backward pass, weights are updated to minimize the loss. To do so, the gradients of the loss with respect to the weights are computed, and each weight receives a different update.
Answer :
  • Reuse the Scikit-Learn code in preprocessing data via Inference Pipeline

Explanation :

An inference pipeline is an Amazon SageMaker model that is composed of a linear sequence of two to five containers that process requests for inferences on data. You use an inference pipeline to define and deploy any combination of pre-trained SageMaker built-in algorithms and your own custom algorithms packaged in Docker containers. You can use an inference pipeline to combine preprocessing, predictions, and post-processing data science tasks. Inference pipelines are fully managed.
Answer :
  • IP Insights

Explanation :

The Amazon SageMaker IP Insights algorithm uses statistical modeling and neural networks to capture associations between online resources (for example, online bank accounts) and IPv4 addresses. Under the hood, the algorithm learns vector representations for the online resources and IP addresses where each point is close together if they have been used together. The algorithm itself can learn and incorporate many of the latent factors without requiring us to explicitly model them.
Answer :
  • Train the model using Managed Spot Training and apply a checkpoint configuration.

Explanation :

Managed Spot Training uses Amazon EC2 Spot instance to run training jobs instead of on-demand instances. You can specify which training jobs use spot instances and a stopping condition that specifies how long SageMaker waits for a job to run using Amazon EC2 Spot instances. Managed spot training can optimize the cost of training models up to 90% over on-demand instances. SageMaker manages the Spot interruptions on your behalf.
Answer :
  • Use supervised learning methods to estimate the missing values for each feature

Explanation :

One of the common feature processing is imputing missing values to replace missing values with the mean or median value. It is important to understand your data before choosing a strategy for replacing missing values. Using a supervised learning method to approximate missing values will most likely provide better results. Supervised learning applied to the imputation of missing values is an active field of research.
Answer :
  • Principal Component Analysis (PCA) algorithm
  • K-means algorithm

Explanation :

The Principal Component Analysis (PCA) is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible. The k-means algorithm attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups.
Answer :
  • Students perform considerably poorer the longer time they are exposed to social media.

Explanation :

A correlation coefficient tells you how strong, or how weak, the relationship is between two sets of data. In Mathematics, a coefficient is usually the number that is used to multiply a variable. So for this expression: 9x, the number 9 is the coefficient. A correlation between two variables or data sets indicates that as one variable changes in value, the other variable tends to change in a specific direction.