DP-100T01: Designing and Implementing a Data Science Solution on Azure Quiz Questions and Answers

A set of CSV files contains sales records. All the CSV files have the same data schema. Each CSV file contains the sales record for a particular month and has the filename sales.csv. Each file is stored in a folder that indicates the month and year when the data was recorded. The folders are in an Azure blob container for which a datastore has been defined in an Azure Machine Learning workspace. The folders are organized in a parent folder named sales to create the following hierarchical structure:/sales /01-2019 /sales.csv /02-2019 /sales.csv /03-2019 /sales.csv, At the end of each month, a new folder with that month's sales file is added to the sales folder. You plan to use the sales data to train a machine learning model based on the following requirements: ✑ You must define a dataset that loads all sales data to date into a structure that can be easily converted into a data frame. ✑ You must be able to create experiments that use only data that was created before a specific previous month, ignoring any data that was added after that month. ✑ You must register the minimum number of datasets possible. You need to register the sales data as a dataset in the Azure Machine Learning service workspace. What should you do?

Answer :
  • Create a tabular dataset that references the datastore and explicitly specifies each 'sales/mm-yyyy/sales.csv' file. Register the dataset with the name sales_dataset each month as a new version and with a tag named month indicating the month and year it was registered. Use this dataset for all experiments, identifying the version to be used based on the month tag as necessary.

Explanation :

Specify the path. Example: The following code gets the workspace existing workspace and the desired datastore by name. And then passes the datastore and file locations to the path parameter to create a new TabularDataset, weather_ds. from azureml.core import Workspace, Datastore, Dataset datastore_name = 'your datastore name' # get existing workspace workspace = Workspace.from_config() # retrieve an existing datastore in the workspace by name datastore = Datastore.get(workspace, datastore_name) # create a TabularDataset from 3 file paths in datastore datastore_paths = [(datastore, 'weather/2018/11.csv'), (datastore, 'weather/2018/12.csv'), (datastore, 'weather/2019/*.csv')] weather_ds = Dataset.Tabular.from_delimited_files(path=datastore_paths)

You are creating a new experiment in Azure Machine Learning Studio. One class has a much smaller number of observations than the other classes in the training set. You need to select an appropriate data sampling strategy to compensate for the class imbalance. Solution: You use the Scale and Reduce sampling mode. Does the solution meet the goal?

Answer :
  • No

Explanation :

Instead use the Synthetic Minority Oversampling Technique (SMOTE) sampling mode. Note: SMOTE is used to increase the number of underepresented cases in a dataset used for machine learning. SMOTE is a better way of increasing the number of rare cases than simply duplicating existing cases. Incorrect Answers: Common data tasks for the Scale and Reduce sampling mode include clipping, binning, and normalizing numerical values. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/smote https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/data-transformation-scale-and-reduce

You create an Azure Machine Learning workspace. You are preparing a local Python environment on a laptop computer. You want to use the laptop to connect to the workspace and run experiments. You create the following config.json file. { "workspace_name " : "ml-workspace " } You must use the Azure Machine Learning SDK to interact with data and experiments in the workspace. You need to configure the config.json file to connect to the workspace from the Python environment. Which two additional parameters must you add to the config.json file in order to connect to the workspace?

Answer :
  • resource_group
  • subscription_id

Explanation :

To use the same workspace in multiple environments, create a JSON configuration file. The configuration file saves your subscription (subscription_id), resource (resource_group), and workspace name so that it can be easily loaded. The following sample shows how to create a workspace. from azureml.core import Workspace ws = Workspace.create(name='myworkspace', subscription_id='< azure-subscription-id >', resource_group='myresourcegroup', create_resource_group=True, location='eastus2' ) Reference: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace.workspace

You plan to build a team data science environment. Data for training models in machine learning pipelines will be over 20 GB. You have the following requirements: • Models must be built using Caffe2 or Chainer frameworks. • Data scientists must be able to use a data science environment to build the machine learning pipelines and train models on their personal devices in both connected and disconnected network environments. Personal devices must support updating machine learning pipelines when connected to a network. You need to select a data science environment. Which environment should you use?

Answer :
  • Azure Machine Learning Service

Explanation :

The Data Science Virtual Machine (DSVM) is a customized VM image on Microsoft’s Azure cloud built specifically for doing data science. Caffe2 and Chainer are supported by DSVM. DSVM integrates with Azure Machine Learning. Incorrect Answers: B: Use Machine Learning Studio when you want to experiment with machine learning models quickly and easily, and the built-in machine learning algorithms are sufficient for your solutions. References: https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/overview

You are with a time series dataset in Azure Machine Learning Studio. You need to split your dataset into training and testing subsets by using the Split Data module. Which splitting mode should you use?

Answer :
  • Split Rows with the Randomized split parameter set to true

Explanation :

Split Rows: Use this option if you just want to divide the data into two parts. You can specify the percentage of data to put in each split, but by default, the data is divided 50-50. Incorrect Answers: B: Regular Expression Split: Choose this option when you want to divide your dataset by testing a single column for a value. C: Relative Expression Split: Use this option whenever you want to apply a condition to a number column. Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/split-data