Machine Learning with Big Data Coursera Quiz Answer
Want to make sense of the volumes of data you have collected? Need to incorporate data-driven decisions into your process? This course provides an overview of machine learning techniques to explore, analyze, and leverage data. You will be introduced to tools and algorithms you can use to create machine learning models that learn from data, and to scale those models up to big data problems.
Enroll Now: Machine Learning with Big Data
Quiz 1 – Machine Learning Overview
Q1) What is NOT machine learning?
- Learning from data
- Data-driven decisions
- Discover hidden patterns
- Explicit, step-by-step programming
Q2) Which of the following is NOT a category of machine learning?
- Regression
- Classification
- Cluster Analysis
- Association Analysis
- Algorithm Prediction
Q3) Which categories of machine learning techniques are supervised?
- classification and regression
- regression and association analysis
- classification and cluster analysis
- cluster analysis and association analysis
Q4) In unsupervised approaches,
- the target is unlabeled.
- the target is provided.
- the target is unknown or unavailable.
- the target is what is being predicted.
Q5) What is the sequence of the steps in the machine learning process?
- Acquire -> Prepare -> Analyze -> Report -> Act
- Acquire -> Prepare -> Analyze -> Act -> Report
- Prepare -> Acquire -> Analyze -> Report -> Act
- Prepare -> Acquire -> Analyze -> Act -> Report
Q6) Are the steps in the machine learning process apply-once or iterative?
- Apply-once
- Iterative
- The first two steps, Acquire and Prepare, are apply-once, and the other steps are iterative.
Q7) Phase 2 of CRISP-DM is Data Understanding. In this phase,
- we prepare the data for analysis.
- we define the problem or opportunity to be addressed.
- we acquire as well as explore the data that is related to the problem.
Q8) What is the main difference between KNIME and Spark MLlib?
- KNIME requires programming, while Spark MLlib does not.
- KNIME requires programming in Java, while Spark MLlib requires programming in Python.
- KNIME originated in Germany, while Spark MLlib was created in California, USA.
- KNIME is a graphical user interface-based machine learning tool, while Spark MLlib provides a programming-based distributed platform for scalable machine learning algorithms.
Quiz 2 – Data Exploration
Q1) Which of these statements is true about samples and variables?
- All of these statements are true.
- A sample can have many variables to describe it.
- A sample is an instance or example of an entity in your data.
- A variable describes a specific characteristic of an entity in your data.
Q2) Other names for ‘variable’ are
- categorical, nominal
- feature, column, attribute
- sample, row, observation
- numerical, quantitative
Q3) What is the purpose of exploring data?
- To digitize your data.
- To generate labels for your data.
- To gather your data into one repository.
- To gain a better understanding of your data.
Q4) What are the two main categories of techniques for exploring data? Choose two.
- Histogram
- Outliers
- Trends
- Correlations
- Visualization
- Summary statistics
Q5) Which of the following are NOT examples of summary statistics?
- skewness, kurtosis
- mean, median, mode
- data sources, data locations
- standard deviation, range, variation
Q6) What are the two measures for measuring shape as mentioned in the lecture? Choose two.
- Range
- Mode
- Kurtosis
- Skewness
- Contingency Table
Q7) Which of the following would NOT be a good reason to use a box plot?
- To show and compare distribution values
- To show correlations between two variables.
- To show data distribution shapes such as asymmetry and skewness.
Q8) All of the following are true about data visualization EXCEPT
- Is useful for communicating results.
- Provides an intuitive way to look at data.
- Is more important than summary statistics for data exploration
- Should be used with summary statistics for data exploration.
Quiz 3 – Data Exploration in KNIME and Spark
Q1) What is the maximum of the average wind speed measurements at 9am (to 2 decimal places)?
- 5.50
- 4.55
- 23.55
- 29.84
Q2) How many rows containing rain accumulation at 9am measurements have missing values?
- 6
- 4
- 3
- 2
Q3) What is the correlation between the relative humidity at 9am and at 3pm (to 2 decimal places, and without removing or imputing missing values)?
- 0.88
- 1.00
- -0.45
- 0.19
Q4) If the histogram for air temperature at 9am has 50 bins, what is the number of elements in the bin with the most elements (without removing or imputing missing values)?
- 57
- 224
- 49
- 166
Q5) What is the approximate maximum max_wind_direction_9am when the maximum max_wind_speed_9am occurs?
- 70
- 30
- 312
Quiz 4 – Data Preparation
Q1) Which of the following is NOT a data quality issue?
- Scaled data
- Missing values
- Duplicate data
- Inconsistent data
Q2) Imputing missing data means to
- drop samples with missing values.
- replace missing values with outliers.
- merge samples with missing values.
- replace missing values with something reasonable.
Q3) A data sample with values that are considerably different than the rest of the other data samples in the dataset is called an/a _____________.
- Noise
- Outlier
- Invalid data
- Inconsistent data
Q4) Which one of the following examples illustrates the use of domain knowledge to address a data quality issue?
- Drop samples with missing values
- Merge duplicate records while retaining relevant data
- Simply discard the samples that lie significantly outside the distribution of your data
- None of these
Q5) Which of the following is NOT an example of feature selection?
- Removing a feature with a lot of missing values.
- Replacing a missing value with the variable mean.
- Adding an in-state feature based on an applicant’s home state.
- Re-formatting an address field into separate street address, city, state, and zip code fields.
Q6) Which one of the following is the best feature set for your analysis?
- Feature set with the smallest number of features
- Feature set with the largest number of features
- Feature set that contains exclusively re-coded features
- Feature set with the smallest set of features that best capture the characteristics of the data for the intended application
Q7) The mean value and the standard deviation of a zero-normalized feature are
- mean = 0 and standard deviation = 0
- mean = 1 and standard deviation = 0
- mean = 0 and standard deviation = 1
- mean = 1 and standard deviation = 1
Q8) Which of the following is NOT true about PCA?
- PCA stands for principal component analysis
- PC1 and PC2, the first and second principal components, respectively, are always orthogonal to each other.
- PC1, the first principal component , captures the largest amount of variance in the data along a single dimension.
- PCA is a dimensionality reduction technique that removes a feature that is very correlated with another feature.
Quiz 5 – Handling Missing Valuers in KNIME and Spark
Q1) If we remove all missing values from the data, how many air pressure at 9am measurements have values between 911.736 and 914.67?
- 77
- 80
- 287
Q2) If we impute the missing values with the minimum value, how many air temperature at 9am measurements are less than 42.292?
- 28
- 23
- 1
- 5
Q3) How many samples have missing values for air_pressure_9am?
- 3
- 5
- 1092
- 0
Q4) Which column in the weather dataset has the most number of missing values?
- number
- rain_accumulation_9am
- They are all the same
- air_temp_9am
Q5) When we remove all the missing values from the dataset, the number of rows is 1064, yet the variable with most missing values has 1089 rows. Why did the number of rows decrease so much?
- Because rows with missing values as well as rows with 0s are removed
- Because the missing values in each column are not necessarily in the same row
- Because rows with missing values as well as rows with duplicate values are removed
Quiz 6 – Classification
Q1) Which of the following is a TRUE statement about classification?
- Classification is a supervised task.
- Classification is an unsupervised task.
- In a classification problem, the target variable has only two possible outcomes.
Q2) In which phase are model parameters adjusted?
- Testing phase
- Training phase
- Data preparation phase
- Model parameters are constant throughout the modeling process.
Q3) Which classification algorithm uses a probabilistic approach?
- naive bayes
- decision tree
- k-nearest-neighbors
- none of the above
Q4) What does the ‘k’ stand for in k-nearest-neighbors?
- the number of samples in the dataset
- the number of training datasets
- the number of nearest neighbors to consider in classifying a sample
- the distance between neighbors: All neighboring samples that are ‘k’ distance apart from the sample are considered in classifying that sample.
Q5) During construction of a decision tree, there are several criteria that can be used to determine when a node should no longer be split into subsets. Which one of the following is NOT applicable?
- The tree depth reaches a maximum threshold.
- All (or X% of) samples have the same class label.
- The value of the Gini index reaches a maximum threshold.
- The number of samples in the node reaches a minimum threshold.
Q6) Which statement is true of tree induction?
- All of these statements are true of tree induction.
- An impurity measure is used to determine the best split for a node.
- You want to split the data in a node into subsets that are as homogeneous as possible
- For each node, splits on all variables are tested to determine the best split for the node.
Q7) What does ‘naive’ mean in Naive Bayes?
- The full Bayes’ Theorem is not used. The ‘naive’ in naive bayes specifies that a simplified version of Bayes’ Theorem is used.
- The Bayes’ Theorem makes estimating the probabilities easier. The ‘naïve’ in the name of classifier comes from this ease of probability calculation.
- The model assumes that the input features are statistically independent of one another. The ‘naïve’ in the name of classifier comes from this naïve assumption.
Q8) The feature independence assumption in Naive Bayes simplifies the classification problem by
- ignoring the prior probabilities altogether.
- assuming that classes are independent of the input features.
- assuming that the prior probabilities of all classes are independent of one another.
- allowing the probability of each feature given the class to be estimated individually.
Quiz 7 — Classification in KNIME and Spark
1) KNIME: In configuring the Numeric Binner node, what would happen if the definition for the humidity_low bin is changed from] -infinity ... 25.0 [ to ] -infinity ... 25.0 ]
(i.e., the last bracket is changed from [ to ] ?
- The definition for the humidity_low bin would change from excluding 25.0 to including 25.0
- The definition for the humidity_low bin would change from having 25.0 as the endpoint to having 25.1 as the endpoint
- Nothing would change
2) KNIME: Considering the Numeric Binner node again, what would happen if the “Append new column” box is not checked?
- The relative_humidity_3pm variable will become a categorical variable
- The relative_humidity_3pm variable will become undefined, and an error will occu
- The relaltive_humidity_3pm variable will remain unchanged, and a new unnamed categorical variable will be created
3) KNIME: How many samples had a missing value for air_temp_9am before missing values were addressed?
- 5
- 3
- 0
4) KNIME: How many samples were placed in the test set after the dataset was partitioned into training and test sets?
- 213
- 851
- 20
5) KNIME: What are the target and predicted class labels for the first sample in the test set?
- Both are humidity_not_low
- Target class label is humidity_not_low, and predicted class label is humidity_low
- Target class label is humidity_low, and predicted class label is humidity_not_low
6) Spark: What values are in the number column?
- Integer values starting at 0
- Time and date values
- Random integer values
7) Spark: With the original dataset split into 80% for training and 20% for test, how many of the first 20 samples from the test set were correctly classified?
- 19
- 10
- 1
8) Spark: If we split the data using 70% for training data and 30% for test data, how many samples would the training set have (using seed 13234)?
- 730
- 334
- 70
Quiz 8 – Model Evaluation
Q1) A model that generalizes well means that
- The model is overfitting.
- The model performs well on data not used in training.
- The model does a good job of fitting to the noise in the data.
- The model performs well on data used to adjust its parameters.
Q2) What indicates that the model is overfitting?
- High training error and low generalization error
- Low training error and high generalization error
- High training error and high generalization error
- Low training error and low generalization error
Q3) Which method is used to avoid overfitting in decision trees?
- Post-pruning
- None of these
- Pre-pruning
- Pre-pruning and post-pruning
Q4) Which of the following best describes a way to create and use a validation set to avoid overfitting?
- random sub-sampling
- k-fold cross-validation
- leave-one-out cross-validation
- All of these
Q5) Which of the following statements is NOT correct?
- The test set is used for model selection to avoid overfitting.
- The training set is used to adjust the parameters of the model.
- The test set is used to evaluate model performance on new data.
- The validation set is used to determine when to stop training the model.
Q6) How is the accuracy rate calculated?
- Add the number of true positives and the number of false negatives.
- Divide the number of true positives by the number of true negatives.
- Divide the number of correct predictions by the total number of predictions
- Subtract the number of correct predictions from the total number of predictions.
Q7) Which evaluation metrics are commonly used for evaluating the performance of a classification model when there is a class imbalance problem?
- precision and recall
- accuracy and error
- precision and error
- precision and accuracy
Q8) How do you determine the classifier accuracy from the confusion matrix?
- Divide the sum of the diagonal values in the confusion matrix by the sum of the off-diagonal values.
- Divide the sum of all the values in the confusion matrix by the total number of samples.
- Divide the sum of the diagonal values in the confusion matrix by the total number of samples.
- Divide the sum of the off-diagonal values in the confusion matrix by the total number of samples.
Quiz 9 — Model Evaluation in KNIME and Spark
Q1) KNIME: In the confusion matrix as viewed in the Scorer node, low_humidity_day is:
- the target class label
- the predicted class label
- the only input variable that is categorical
Q2) KNIME: In the confusion matrix, what is the difference between low_humidity_day and Prediction(low_humidity_day)?
- There is no difference. The two are the same
- low_humidity_day is the target class label, and Prediction(low_humidity_day) is the predicted class label
- low_humidity_day is the predicted class label, and Prediction(low_humidity_day) is the target class label
Q3) KNIME: In the Table View of the Interactive Table, each row is color-coded. Blue specifies:
- that the target class label for the sample is humidity_not_low
- that the target class label for the sample is humidity_low
- that the predicted class label for the sample is humidity_not_low
- that the predicted class label for the sample is humidity_low
Q4) KNIME: To change the colors used to color-code each sample in the Table View of the Interactive Table node:
- It is not possible to change these colors
- change the color settings in the Color Manager node
- change the color settings in the Interactive Table dialog
Q5) KNIME: In the Table View of the Interactive Table, the values in RowID are not consecutive because:
- the samples are randomly ordered in the table
- only a few samples from the test set are randomly selected and displayed here
- the RowID values are from the original dataset, and only the test samples are displayed here
Q6) Spark: To get the error rate for the decision tree model, use the following code:
print ("Error = %g " % (1.0 - accuracy)) [X]
evaluator = MuticlassClassificationEvaluator( labelCol="label", predictionCol="prediction", metricName="error")
error = evaluator.evaluate(1 - predictions)
Q7) Spark: To print out the accuracy as a percentage, use the following code:print ("Accuracy = %.2g" % (accuracy * 100)) [X]
print ("Accuracy = %100g" % (accuracy))
print ("Accuracy = %100.2g" % (accuracy))
Q8) Spark: In the last line of code in Step 4, the confusion matrix is printed out. If the “transpose()” is removed, the confusion matrix will be displayed as:
array([[87., 14.], [X] [26., 83.]])
array([[83., 26.], [14., 87.]])
array([[83., 87.], [14., 26.]])
Quiz 10 — Regression, Cluster Analysis, & Association Analysis
Q1) What is the main difference between classification and regression?
- In classification, you’re predicting a number, and in regression, you’re predicting a category.
- In classification, you’re predicting a category, and in regression, you’re predicting a number.
- There is no difference since you’re predicting a numeric value from the input variables in both tasks.
- In classification, you’re predicting a categorical variable, and in regression, you’re predicting a nominal variable.
Q2) Which of the following is NOT an example of regression?
- Predicting the price of a stock
- Estimating the amount of rain
- Predicting the demand for a product
- Determining whether power usage will rise or fall
Q3) In linear regression, the least squares method is used to
- Determine the distance between two pairs of samples.
- Determine whether the target is categorical or numerical.
- Determine the regression line that best fits the samples.
- Determine how to partition the data into training and test sets.
Q4) How does simple linear regression differ from multiple linear regression?
- They are the just different terms for linear regression with one input variable.
- In simple linear regression, the input has only one variable. In multiple linear regression, the input has more than one variables.
- In simple linear regression, the input has only categorical variables. In multiple linear regression, the input has only numerical variables.
- In simple linear regression, the input has only categorical variables. In multiple linear regression, the input can be a mix of categorical and numerical variables.
Q5) The goal of cluster analysis is
- To segment data so that all samples are evenly divided among the clusters.
- To segment data so that all categorical variables are in one cluster, and all numerical variables are in another cluster.
- To segment data so that differences between samples in the same cluster are maximized and differences between samples of different clusters are minimized.
- To segment data so that differences between samples in the same cluster are minimized and differences between samples of different clusters are maximized.
Q6) Cluster results can be used to
- Determine anomalous samples
- Classify new samples
- Create labeled samples for a classification task
- All of these choices are valid uses of the resulting clusters.
- Segment the data into groups so that each group can be analyzed further
Q7) A cluster centroid is
- The mean of all the samples in the cluster
- The mean of all the samples in all clusters
- The mean of all the samples in the two closest clusters.
- The mean of all the samples in the two farthest clusters.
Q8) The main steps in the k-means clustering algorithm are
- Assign each sample to the closest centroid, then calculate the new centroid.
- Calculate the centroids, then determine the appropriate stopping criterion depending on the number of centroids.
- Calculate the distances between the cluster centroids, then find the two closest centroids.
- Count the number of samples, then determine the initial centroids.
Q9) The goal of association analysis is
- To find the number of outliers in the data
- To find rules to capture associations between items or events
- To find the number of clusters for cluster analysis
- To find the most complex rules to explain associations between as many items as possible in the data.
Q10) In association analysis, an item set is
- A transaction or set of items that occur together
- A set of items that two rules have in common
- A set of items that infrequently occur together
- A set of transactions that occur a certain number of times in the data
Q11) The support of an item set
- Captures the frequency of that item set
- Captures the number of items in that item set
- Captures how many times that item set is used in a rule
- Captures the correlation between the items in that item set
Quiz 11 – Cluster Analysis in Spark
Q1) What percentage of samples have 0 for rain_accumulation?
- 157812 / 158726 = 99.4%
- 157237 / 158726 = 99.1%
- There is not enough information to determine this
Q2) Why is it necessary to scale the data (Step 4)?
- Since the values of the features are on different scales, all features need to be scaled so that all values will be positive.
- Since the values of the features are on different scales, all features need to be scaled so that no one feature dominates the clustering results.
- Since the values of the features are on different scales, all features need to be scaled so that the cluster centers can be displayed on the same plot for easier analysis.
Q3) If we wanted to create a data subset by taking every 5th sample instead of every 10th sample, how many samples would be in that subset?
- 317,452
- 1,587,257
- 158,726
Q4) This line of code creates a k-means model with 12 clusters:
kmeans = KMeans (k=12, seed=1)
What is the significance of “seed=1”?
- This sets the seed to a specific value, which is necessary to reproduce the k-means results
- This specifies that the first cluster centroid is set to sample #1
- This means that this is the first iteration of k-means. The seed value is incremented by 1 every time k-means is executed
Q5) Just by looking at the values for the cluster centers, which cluster contains samples with the lowest relative humidity?
- Cluster 4
- Cluster 3
- Cluster 9
Q6) What do clusters 7, 8, and 11 have in common?
- They capture weather patterns associated with warm and dry days
- They capture weather patterns associated with high air pressure
- They capture weather patterns associated with very strong winds
Q7) If we perform clustering with 20 clusters (and seed = 1), which cluster appears to identify Santa Ana conditions (lowest humidity and highest wind speeds)?
- Cluster 12
- Cluster 1
- Cluster 16
Q8) We did not include the minimum wind measurements in the analysis since they are highly correlated with the average wind measurements. What is the correlation between min_wind_speed and avg_wind_speed (to two decimals)? (Compute this using one-tenth of the original dataset, and dropping all rows with missing values.)
- 0.97
- -0.12
- 0.62
Q12) Rule confidence is used to
- Identify frequent item sets
- Measure the intuitiveness of a rule
- Determine the rule with the most items
- Prune rules by eliminating rules with low confidence
Conclution:
We hope you will do well in your Machine Learning with Big Data Coursera Quiz Answer by our article. If you think It helps you a little please share it with your friends. And Stay with queryfor.com for any kind of Exam or quiz Answer. We also provide Coursera Quiz answer, Coursehero free Unlock.