# Machine Learning All Week Coursera Quiz Answer

Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress towards human-level AI. In this class, you will learn about the most effective machine learning techniques, and gain practice implementing them and getting them to work for yourself. More importantly, you’ll learn about not only the theoretical underpinnings of learning, but also gain the practical know-how needed to quickly and powerfully apply these techniques to new problems. Finally, you’ll learn about some of Silicon Valley’s best practices in innovation as it pertains to machine learning and AI.

This course provides a broad introduction to machine learning, datamining, and statistical pattern recognition. Topics include: (i) Supervised learning (parametric/non-parametric algorithms, support vector machines, kernels, neural networks). (ii) Unsupervised learning (clustering, dimensionality reduction, recommender systems, deep learning). (iii) Best practices in machine learning (bias/variance theory; innovation process in machine learning and AI). The course will also draw from numerous case studies and applications, so that you’ll also learn how to apply learning algorithms to building smart robots (perception, control), text understanding (web search, anti-spam), computer vision, medical informatics, audio, database mining, and other areas.

Enroll Now: Machine Learning

## Machine Learning All Week Coursera Quiz Answer Week-1

### Introduction

1. A computer program is said to learn from experience E with respect to some task T and some performance measure P if its performance on T, as measured by P, improves with experience E. Suppose we feed a learning algorithm a lot of historical weather data, and have it learn to predict weather. What would be a reasonable choice for P?
•  The probability of it correctly predicting a future date’s weather.
•  The process of the algorithm examining a large amount of historical weather data.
•  None of these.

2. A computer program is said to learn from experience E with respect to some task T and some performance measure P if its performance on T, as measured by P, improves with experience E. Suppose we feed a learning algorithm a lot of historical weather data, and have it learn to predict weather. In this setting, what is T?

•  None of these.
•  The probability of it correctly predicting a future date’s weather.
•  The process of the algorithm examining a large amount of historical weather data.

3. Suppose you are working on weather prediction, and use a learning algorithm to predict tomorrow’s temperature (in degrees Centigrade/Fahrenheit).
Would you treat this as a classification or a regression problem?

•  Regression
•  Classification

4. Suppose you are working on weather prediction, and your weather station makes one of three predictions for each day’s weather: Sunny, Cloudy or Rainy. You’d like to use a learning algorithm to predict tomorrow’s weather.
Would you treat this as a classification or a regression problem?

•  Regression
•  Classification

5. Suppose you are working on stock market prediction, and you would like to predict the price of a particular stock tomorrow (measured in dollars). You want to use a learning algorithm for this.
Would you treat this as a classification or a regression problem?

•  Regression
•  Classification

6. Suppose you are working on stock market prediction. You would like to predict whether or not a certain company will declare bankruptcy within the next 7 days (by training on data of similar companies that had previously been at risk of bankruptcy).
Would you treat this as a classification or a regression problem?

•  Regression
•  Classification

7. Suppose you are working on stock market prediction, Typically tens of millions of shares of Microsoft stock are traded (i.e., bought/sold) each day. You would like to predict the number of Microsoft shares that will be traded tomorrow.
Would you treat this as a classification or a regression problem?

•  Regression
•  Classification

8. Some of the problems below are best addressed using a supervised learning algorithm, and the others with an unsupervised learning algorithm. Which of the following would you apply supervised learning to? (Select all that apply.) In each case, assume some appropriate dataset is available for your algorithm to learn from.

•  Given historical data of children’s ages and heights, predict children’s height as a function of their age.
•  Given 50 articles written by male authors, and 50 articles written by female authors, learn to predict the gender of a new manuscript’s author (when the identity of this author is unknown).
•  Take a collection of 1000 essays written on the US Economy, and find a way to automatically group these essays into a small number of groups of essays that are somehow “similar” or “related”.
•  Examine a large collection of emails that are known to be spam email, to discover if there are sub-types of spam mail.

9. Some of the problems below are best addressed using a supervised learning algorithm, and the others with an unsupervised learning algorithm. Which of the following would you apply supervised learning to? (Select all that apply.) In each case, assume some appropriate dataset is available for your algorithm to learn from.

•  Given data on how 1000 medical patients respond to an experimental drug (such as effectiveness of the treatment, side effects, etc.), discover whether there are different categories or “types” of patients in terms of how they respond to the drug, and if so what these categories are.
•  Given a large dataset of medical records from patients suffering from heart disease, try to learn whether there might be different clusters of such patients for which we might tailor separate treatments.
•  Have a computer examine an audio clip of a piece of music, and classify whether or not there are vocals (i.e., a human voice singing) in that audio clip, or if it is a clip of only musical instruments (and no vocals).
•  Given genetic (DNA) data from a person, predict the odds of him/her developing diabetes over the next 10 years.

10. Some of the problems below are best addressed using a supervised learning algorithm, and the others with an unsupervised learning algorithm. Which of the following would you apply supervised learning to? (Select all that apply.) In each case, assume some appropriate dataset is available for your algorithm to learn from.

•  Take a collection of 1000 essays written on the US Economy, and find a way to automatically group these essays into a small number of groups of essays that are somehow “similar” or “related”.
•  Given genetic (DNA) data from a person, predict the odds of him/her developing diabetes over the next 10 years.
•  Examine a large collection of emails that are known to be spam email, to discover if there are sub-types of spam mail.
•  Examine the statistics of two football teams, and predict which team will win tomorrow’s match (given historical data of teams’ wins/losses to learn from).

11. Which of these is a reasonable definition of machine learning?

•  Machine learning is the science of programming computers.
•  Machine learning learns from labeled data.
•  Machine learning is the field of allowing robots to act intelligently.
•  Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed.

### Linear Regression with One Variable :

1. Consider the problem of predicting how well a student does in her second year of college/university, given how well she did in her first year. Specifically, let x be equal to the number of “A” grades (including A-. A and A+ grades) that a student receives in their first year of college (freshmen year). We would like to predict the value of y, which we define as the number of “A” grades they get in their second year (sophomore year).
Here each row is one training example. Recall that in linear regression, our hypothesis is  to denote the number of training examples.

For the training set given above (note that this training set may also be referenced in other questions in this quiz), what is the value of ? In the box below, please enter your answer (which should be a number between 0 and 10).

1. Many substances that can burn (such as gasoline and alcohol) have a chemical structure based on carbon atoms; for this reason they are called hydrocarbons. A chemist wants to understand how the number of carbon atoms in a molecule affects how much energy is released when that molecule combusts (meaning that it is burned). The chemist obtains the dataset below. In the column on the right, “kJ/mol” is the unit measuring the amount of energy released.

You would like to use linear regression () to estimate the amount of energy released (y) as a function of the number of carbon atoms (x). Which of the following do you think will be the values you obtain for  and  ? You should be able to select the right answer without actually implementing linear regression.

•   = −569.6,  = 530.9
•   = −1780.0,  = −530.9
•   = −569.6,  = −530.9
•   = −1780.0,  = 530.9

3. For this question, assume that we are using the training set from Q1.
Recall our definition of the cost function was
What is ? In the box below,

4. Suppose we set  in the linear regression hypothesis from Q1. What is  ?

5. Suppose we set  = −2,  = 0.5 in the linear regression hypothesis from Q1. What is ?

6. Let  be some function so that  outputs a number. For this problem,  is some arbitrary/unknown smooth function (not necessarily the cost function of linear regression, so  may have local optima).
Suppose we use gradient descent to try to minimize  as a function of  and .
Which of the following statements are true? (Check all that apply.)

•  If  and  are initialized at the global minimum, then one iteration will not change their values.
•  Setting the learning rate  to be very small is not harmful, and can only speed up the convergence of gradient descent.
•  No matter how  and  are initialized, so long as  is sufficiently small, we can safely expect gradient descent to converge to the same solution.
•  If the first few iterations of gradient descent cause  to increase rather than decrease, then the most likely cause is that we have set the learning rate  to too large a value.

7. In the given figure, the cost function  has been plotted against  and , as shown in ‘Plot 2’. The contour plot for the same cost function is given in ‘Plot 1’. Based on the figure, choose the correct options (check all that apply).

•  If we start from point B, gradient descent with a well-chosen learning rate will eventually help us reach at or near point A, as the value of cost function  is maximum at point A.
•  If we start from point B, gradient descent with a well-chosen learning rate will eventually help us reach at or near point C, as the value of cost function  is minimum at point C.
•  Point P (the global minimum of plot 2) corresponds to point A of Plot 1.
•  If we start from point B, gradient descent with a well-chosen learning rate will eventually help us reach at or near point A, as the value of cost function  is minimum at A.
•  Point P (The global minimum of plot 2) corresponds to point C of Plot 1.

8. Suppose that for some linear regression problem (say, predicting housing prices as in the lecture), we have some training set, and for our training set we managed to find some , such that .
Which of the statements below must then be true? (Check all that apply.)

•  Gradient descent is likely to get stuck at a local minimum and fail to find the global minimum.
•  For this to be true, we must have  and
so that
•  For this to be true, we must have  for every value of  = 1, 2,…,.
•  Our training set can be fit perfectly by a straight line, i.e., all of our training examples lie perfectly on some straight line.

### Linear Algebra :

3. Let two matrices be
$\inline&space;\begin{bmatrix}&space;8\\&space;2\\&space;5\\&space;1&space;\end{bmatrix}$
What is 2∗x ?

4. Let two matrices be
$\inline&space;x&space;=&space;\begin{bmatrix}&space;5\\&space;5\\&space;2\\&space;7&space;\end{bmatrix}$
What is 2∗x ?

5. Let u be a 3-dimensional vector, where specifically
$\inline&space;u&space;=&space;\begin{bmatrix}&space;2\\&space;1\\&space;8&space;\end{bmatrix}$
What is $\inline&space;u^{T}$ ?

6. Let u and v be 3-dimensional vectors, where specifically
$\inline&space;u&space;=&space;\begin{bmatrix}&space;4&space;\\&space;-4&space;\\&space;-3&space;\end{bmatrix}$
and
$\inline&space;v&space;=&space;\begin{bmatrix}&space;4&space;\\&space;2&space;\\&space;4&space;\end{bmatrix}$
what is $\inline&space;u^{T}v$ ?
(Hint$\inline&space;u^{T}v$ is a 1×3 dimensional matrix, and v can also be seen as a 3×1 matrix. The answer you want can be obtained by taking the matrix product of $\inline&space;u^{T}$ and $\inline&space;v$.) Do not add brackets to your answer.

-4

6. Let A and B be 3×3 (square) matrices. Which of the following must necessarily hold true? Check all that apply.

7. Let A and B be 3×3 (square) matrices. Which of the following must necessarily hold true? Check all that apply.

## Machine Learning All Week Coursera Quiz Answer Week-2

### Linear Regression with Multiple Variables

1. Suppose m=4 students have taken some classes, and the class had a midterm exam and a final exam. You have collected a dataset of their scores on the two exams, which is as follows:

You’d like to use polynomial regression to predict a student’s final exam score from their midterm exam score. Concretely, suppose you want to fit a model of the form $\inline&space;h_{\theta}(x)&space;=&space;\theta_{0}&space;+&space;\theta_{1}&space;x_{1}&space;+&space;\theta_{2}&space;x_{2}$, where $\inline&space;x_1$ is the midterm score and x_2 is (midterm score)^2. Further, you plan to use both feature scaling (dividing by the “max-min”, or range, of a feature) and mean normalization.
What is the normalized feature $\inline&space;x_2^{(4)}$ ? (Hint: midterm = 69, final = 78 is training example 4.) Please round off your answer to two decimal places and enter in the text box below.

1. You run gradient descent for 15 iterations with $\inline&space;\alpha&space;=&space;0.3$ and compute after each iteration. You find that the value of $\inline&space;J(\theta)$ decreases slowly and is still decreasing after 15 iterations. Based on this, which of the following conclusions seems most plausible?
•  Rather than use the current value of α, it’d be more promising to try a larger value of α (say α= 1.0).
•  Rather than use the current value of α, it’d be more promising to try a smaller value of α (say α= 0.1).
• α= 0.3 is an effective choice of learning rate.
2. You run gradient descent for 15 iterations with and compute after each iteration. You find that the value of $\inline&space;J(\theta)$ decreases quickly then levels off. Based on this, which of the following conclusions seems most plausible?
•  Rather than use the current value of α, it’d be more promising to try a larger value of α (say α=1.0).
•  Rather than use the current value of α, it’d be more promising to try a smaller value of α (say α= 0.1).
• α= 0.3 is an effective choice of learning rate.
3. Suppose you have m = 23 training examples with n = 5 features (excluding the additional all-ones feature for the intercept term, which you should add). The normal equation is $\inline&space;\theta&space;=&space;(X^{T}&space;X)^{-1}X^{T}y$. For the given values of m and n, what are the dimensions of $\inline&space;\theta$, X, and y in this equation?
•  X is 23 × 5, y is 23 × 1, θ is 5 × 5
•  X is 23 × 6, y is 23 × 6, θ is 6 × 6
•  X is 23 × 6, y is 23 × 1, θ is 6 × 1
X has m rows and n+1 columns (+1 because of the $\inline&space;x_0=1$ term). y is m-vector. $\inline&space;\theta$ is an (n+1)-vector
•  X is 23 × 5, y is 23 × 1, θ is 5 × 1
4. Suppose you have a dataset with m = 1000000 examples and n = 200000 features for each example. You want to use multivariate linear regression to fit the parameters $\inline&space;\theta$ to our data. Should you prefer gradient descent or the normal equation?
With n = 200000 features, you will have to invert a 200001 x 200001 matrix to compute the normal equation. Inverting such a large matrix is computationally expensive, so gradient descent is a good choice.
•  The normal equation, since it provides an efficient way to directly find the solution.
•  The normal equation, since gradient descent might be unable to find the optimal θ.
5. Which of the following are reasons for using feature scaling?
•  It is necessary to prevent gradient descent from getting stuck in local optima.
The cost function $\inline&space;J(\theta)$ for linear regression has no local optima.
•  It speeds up solving for θ using the normal equation.
The magnitute of the feature values are nsignificant in terms of computational cost.
Feature scaling has nothing to do with matrix inversion.
•  It speeds up gradient descent by making it require fewer iterations to get to a good solution.
Feature scaling speeds up gradient descent by avoiding many extra iterations that are required when one or more features takes on much larger values than he rest.

### Octave / Matlab Tutorial :

1. Suppose I first execute the following Octave/Matlab commands:
A = [1 2; 3 4; 5 6];
B = [1 2 3; 4 5 6];
Which of the following are then valid commands? Check all that apply. (Hint: A’ denotes the transpose of A.)
•  C = A * B;
•  C = B’ + A;
•  C = A’ * B;
• C = B + A;
2. Let
$\inline&space;A&space;=&space;\begin{bmatrix}&space;16&space;&&space;2&space;&&space;3&space;&&space;13\\&space;5&space;&&space;11&space;&&space;10&space;&&space;8\\&space;9&space;&&space;7&space;&&space;6&space;&&space;12\\&space;4&space;&&space;14&space;&&space;15&space;&&space;1&space;\end{bmatrix}$
Which of the following indexing expressions gives
$\inline&space;B&space;=&space;\begin{bmatrix}&space;16&space;&&space;2\\&space;5&space;&&space;11\\&space;9&space;&&space;7\\&space;4&space;&&space;14&space;\end{bmatrix}&space;?$
Check all that apply.
•  B = A(:, 1:2);
•  B = A(1:4, 1:2);
•  B = A(:, 0:2);
•  B = A(0:4, 0:2);
3. Let A be a 10×10 matrix and x be a 10-element vector. Your friend wants to compute the product Ax and writes the following code:
v = zeros(10, 1);
for i = 1:10
for j = 1:10
v(i) = v(i) + A(i, j) * x(j);
end
end
How would you vectorize this code to run without any for loops? Check all that apply.
•  v = A * x;
•  v = Ax;
•  v = x’ * A;
•  v = sum (A * x);
4. Say you have two column vectors v and w, each with 7 elements (i.e., they have dimensions 7×1). Consider the following code:
z = 0;
for i = 1:7
z = z + v(i) * w(i)
end
Which of the following vectorizations correctly compute z? Check all that apply.
•  z = sum (v .* w);
•  z = w’ * v;
•  z = v * w’;
•  z = w * v’;
5. In Octave/Matlab, many functions work on single numbers, vectors, and matrices. For example, the sin function when applied to a matrix will return a new matrix with the sin of each element. But you have to be careful, as certain functions have different behavior. Suppose you have an 7×7 matrix X. You want to compute the log of every element, the square of every element, add 1 to every element, and divide every element by 4. You will store the results in four matrices, A, B, C, D. One way to do so is the following code:
for i = 1:7
for j = 1:7
A(i, j) = log(X(i, j));
B(i, j) = X(i, j) ^ 2;
C(i, j) = X(i, j) + 1;
D(i, j) = X(i, j) / 4;
end
end
Which of the following correctly compute A, B, C or D? Check all that apply.
•  C = X + 1;
•  D = X / 4;
•  A = log (X);
•  B = X ^ 2;

## Machine Learning All Week Coursera Quiz Answer Week-3

### Logistic Regression

1. Suppose that you have trained a logistic regression classifier, and it outputs on a new example a prediction $\inline&space;h_\theta(x)$ = 0.2. This means (check all that apply):
•  Our estimate for P(y = 1|x; θ) is 0.8.
h(x) gives P(y=1|x; θ), not 1 – P(y=1|x; θ)
•  Our estimate for P(y = 0|x; θ) is 0.8.
Since we must have P(y=0|x;θ) = 1 – P(y=1|x; θ), the former is
1 – 0.2 = 0.8.
•  Our estimate for P(y = 1|x; θ) is 0.2.
h(x) is precisely P(y=1|x; θ), so each is 0.2.
•  Our estimate for P(y = 0|x; θ) is 0.2.
h(x) is P(y=1|x; θ), not P(y=0|x; θ)
2. Suppose you have the following training set, and fit a logistic regression classifier $\inline&space;h_\theta(x)&space;=&space;g(\theta_0&space;+&space;\theta_1x_1&space;+&space;\theta_2x_2)$.

Which of the following are true? Check all that apply.
3. For logistic regression, the gradient is given by $\inline&space;\frac{\partial&space;}{\partial&space;\theta_j&space;}&space;J(\theta)&space;=&space;\frac{1}{m}&space;\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{i})x^{(i)}_j$. Which of these is a correct gradient descent update for logistic regression with a learning rate of $\inline&space;\alpha$ ? Check all that apply.
4. Which of the following statements are true? Check all that apply.
5. Suppose you train a logistic classifier $\inline&space;h_\theta(x)&space;=&space;g(\theta_0&space;+&space;\theta_1x_1&space;+&space;\theta_2x_2)$. Suppose $\inline&space;\theta_0&space;=&space;6$$\inline&space;\theta_1&space;=&space;-1$$\inline&space;\theta_2&space;=&space;0$. Which of the following figures represents the decision boundary found by your classifier?
•  Figure:

In this figure, we transition from negative to positive when x1 goes from left of 6 to right of 6 which is true for the given values of θ.
•  Figure:
•  Figure:
•  Figure:

### Regularization

1. You are training a classification model with logistic regression. Which of the following statements are true? Check all that apply.
•  Introducing regularization to the model always results in equal or better performance on the training set.
•  Introducing regularization to the model always results in equal or better performance on examples not in the training set.
•  Adding a new feature to the model always results in equal or better performance on the training set.
•  Adding many new features to the model helps prevent overfitting on the training set.

2. Suppose you ran logistic regression twice, once with $\inline&space;\lambda&space;=&space;0$, and once with $\inline&space;\lambda&space;=&space;1$. One of the times, you got parameters $\inline&space;\theta&space;=&space;\begin{bmatrix}&space;74.81\\&space;45.05&space;\end{bmatrix}$, and the other time you got $\inline&space;\theta&space;=&space;\begin{bmatrix}&space;1.37\\&space;0.51&space;\end{bmatrix}$. However, you forgot which value of $\inline&space;\lambda$ corresponds to which value of $\inline&space;\theta$. Which one do you think corresponds to $\inline&space;\lambda&space;=&space;1$?

3. Suppose you ran logistic regression twice, once with $\inline&space;\lambda&space;=&space;0$, and once with $\inline&space;\lambda&space;=&space;1$. One of the times, you got parameters $\inline&space;\theta&space;=&space;\begin{bmatrix}&space;81.47\\&space;12.69&space;\end{bmatrix}$, and the other time you got $\inline&space;\theta&space;=&space;\begin{bmatrix}&space;13.01\\&space;0.91&space;\end{bmatrix}$. However, you forgot which value of $\inline&space;\lambda$ corresponds to which value of $\inline&space;\theta$. Which one do you think corresponds to $\inline&space;\lambda&space;=&space;1$?

4. Which of the following statements about regularization are true? Check all that apply.

5. Which of the following statements about regularization are true? Check all that apply.

6. In which one of the following figures do you think the hypothesis has overfit the training set?

•  Figure:
•  Figure:
•  Figure:
•  Figure:

6. In which one of the following figures do you think the hypothesis has underfit the training set?

•  Figure:
•  Figure:
•  Figure:
•  Figure:

## Machine Learning All Week Coursera Quiz Answer Week-4

### Neural Networks: Representation

1. Which of the following statements are true? Check all that apply.
•  Any logical function over binary-valued (0 or 1) inputs x1 and x2 can be (approximately) represented using some neural network.

•  A two layer (one input layer, one output layer; no hidden layer) neural network can represent the XOR function.

•  The activation values of the hidden units in a neural network, with the sigmoid activation function applied at every layer, are always in the range (0, 1).

1. Consider the following neural network which takes two binary-valued inputs
$\inline&space;x_1,x_2&space;\&space;\epsilon&space;\&space;\{0,1\}$ and outputs $\inline&space;h_\theta(x)$. Which of the following logical functions does it (approximately) compute?
•  AND
This network outputs approximately 1 only when both inputs are 1.

•  NAND (meaning “NOT AND”)

•  OR

•  XOR (exclusive OR)
2. Consider the following neural network which takes two binary-valued inputs
$\inline&space;x_1,x_2&space;\&space;\epsilon&space;\&space;\{0,1\}$ and outputs $\inline&space;h_\theta(x)$. Which of the following logical functions does it (approximately) compute?
•  AND

•  NAND (meaning “NOT AND”)

•  OR
This network outputs approximately 1 when atleast one input is 1.

•  XOR (exclusive OR)
3. Consider the neural network given below. Which of the following equations correctly computes the activation $\inline&space;a_1^{(3)}$? Note: $\inline&space;g(z)$ is the sigmoid activation
function.

4. You have the following neural network:

You’d like to compute the activations of the hidden layer $\inline&space;a^{(2)}&space;\&space;\epsilon&space;\&space;R^3$. One way to do
so is the following Octave code:

You want to have a vectorized implementation of this (i.e., one that does not use for loops). Which of the following implementations correctly compute ? Check all
that apply.

•  a2 = sigmoid (x * Theta1);

•  a2 = sigmoid (Theta2 * x);

•  z = sigmoid(x); a2 = sigmoid (Theta1 * z);
5. You are using the neural network pictured below and have learned the parameters $\inline&space;\theta^{(1)}&space;=&space;\begin{bmatrix}&space;1&space;&&space;1&space;&&space;2.4\\&space;1&space;&&space;1.7&space;&&space;3.2&space;\end{bmatrix}$ (used to compute $\inline&space;a^{(2)}$) and $\inline&space;\theta^{(2)}&space;=&space;\begin{bmatrix}&space;1&space;&&space;0.3&space;&&space;-1.2&space;\end{bmatrix}$ (used to compute $\inline&space;a^{(3)}$ as a function of $\inline&space;a^{(2)}$). Suppose you swap the parameters for the first hidden layer between its two units so $\inline&space;\theta^{(1)}&space;=&space;\begin{bmatrix}&space;1&space;&&space;1.7&space;&&space;3.2&space;\\&space;1&space;&&space;1&space;&&space;2.4&space;\end{bmatrix}$ and also swap the output layer so $\inline&space;\theta^{(2)}&space;=&space;\begin{bmatrix}&space;1&space;&&space;-1.2&space;&&space;0.3&space;\end{bmatrix}$. How will this change the value of the output $\inline&space;h_\theta(x)$?

•  It will increase.

•  It will decrease

•  Insufficient information to tell: it may increase or decrease.

## Machine Learning All Week Coursera Quiz Answer Week-5

### Neural Networks: Learning

1. You are training a three layer neural network and would like to use backpropagation to compute the gradient of the cost function. In the backpropagation algorithm, one of the steps is to update$\inline&space;\Delta_{ij}^{(2)}&space;:=&space;\Delta_{ij}^{(2)}&space;+&space;\delta_i^{(3)}&space;*&space;(a^{(2)})_j$
for every i,j. Which of the following is a correct vectorization of this step?

1. Suppose Theta1 is a 5×3 matrix, and Theta2 is a 4×6 matrix. You set thetaVec = [Theta1(:), Theta2(:)]. Which of the following correctly recovers ?
•  reshape(thetaVec(16 : 39), 4, 6)
This choice is correct, since Theta1 has 15 elements, so Theta2 begins at
index 16 and ends at index 16 + 24 – 1 = 39.

•  reshape(thetaVec(15 : 38), 4, 6)

•  reshape(thetaVec(16 : 24), 4, 6)

•  reshape(thetaVec(15 : 39), 4, 6)

•  reshape(thetaVec(16 : 39), 6, 4)

1. Let $\inline&space;J(\theta)&space;=&space;2\theta^3&space;+&space;2$. Let $\inline&space;\theta&space;=&space;1$, and $\inline&space;\epsilon&space;=&space;0.01$. Use the formula $\inline&space;\frac{J{(\theta&space;+&space;\epsilon)}-J{(\theta&space;-&space;\epsilon)}}{2\epsilon}$ to numerically compute an approximation to the derivative at $\inline&space;\theta&space;=&space;1$. What value do you get? (When $\inline&space;\theta&space;=&space;1$, the true/exact derivati ve is $\inline&space;\frac{\mathrm{d}&space;J(\theta)}{\mathrm{d}&space;\theta}&space;=&space;6$.)
•  8

•  6

•  5.9998

1. Which of the following statements are true? Check all that apply.
•  For computational efficiency, after we have performed gradient checking to verify that our backpropagation code is correct, we usually disable gradient checking before using backpropagation to train the network.
Checking the gradient numerically is a debugging tool: it helps ensure a correct implementation, but it is too slow to use as a method for actually computing gradients.

•  Computing the gradient of the cost function in a neural network has the same efficiency when we use backpropagation or when we numerically compute it using the method of gradient checking.

•  Using gradient checking can help verify if one’s implementation of backpropagation is bug-free.
If the gradient computed by backpropagation is the same as one computed numerically with gradient checking, this is very strong evidence that you have a correct implementation of backpropagation.

•  Gradient checking is useful if we are using one of the advanced optimization methods (such as in fminunc) as our optimization algorithm. However, it serves little purpose if we are using gradient descent.
2. Which of the following statements are true? Check all that apply.
•  If we are training a neural network using gradient descent, one reasonable “debugging” step to make sure it is working is to plot J(Θ) as a function of the number of iterations, and make sure it is decreasing (or at least non-increasing) after each iteration.
Since gradient descent uses the gradient to take a step toward parameters
with lower cost (ie, lower J(Θ)), the value of J(Θ) should be equal or less at each iteration if the gradient computation is correct and the learning rate is set properly.

•  Suppose you are training a neural network using gradient descent. Depending on your random initialization, your algorithm may converge to different local optima (i.e., if you run the algorithm twice with different random initializations, gradient descent may converge to two different solutions).
The cost function for a neural network is non-convex, so it may have multiple minima. Which minimum you find with gradient descent depends on the initialization.

•  If we initialize all the parameters of a neural network to ones instead of zeros, this will suffice for the purpose of “symmetry breaking” because the parameters are no longer symmetrically equal to zero.

## Machine Learning All Week Coursera Quiz Answer Week-6

### Advice for Applying Machine Learning

1. You train a learning algorithm, and find that it has unacceptably high error on the test set. You plot the learning curve, and obtain the figure below. Is the algorithm suffering from high bias, high variance, or neither?

•  High variance

•  Neither

•  High bias
This learning curve shows high error on both the training and test sets, so the algorithm is suffering from high bias.

1. You train a learning algorithm, and find that it has unacceptably high error on the test set. You plot the learning curve, and obtain the figure below. Is the algorithm suffering from high bias, high variance, or neither?

•  High variance
This learning curve shows high error on the test sets but comparatively low error on training set, so the algorithm is suffering from high variance.

•  Neither

•  High bias

1. Suppose you have implemented regularized logistic regression to classify what object is in an image (i.e., to do object recognition). However, when you test your hypothesis on a new set of images, you find that it makes unacceptably large errors with its predictions on the new images. However, your hypothesis performs well (has low error) on the training set. Which of the following are promising steps to take? Check all that apply.

NOTE:
Since the hypothesis performs well (has low error) on the training set, it is suffering from high variance (overfitting)

Adding polynomial feature will increase the high variance problem.

•  Use fewer training examples.
Decreasing training examples will increase the high variance problem.

•  Try using a smaller set of features.
The gap in errors between training and test suggests a high variance problem in which the algorithm has overfit the training set. Reducing the feature set will ameliorate the overfitting and help with the variance problem

•  Get more training examples.
The gap in errors between training and test suggests a high variance problem in which the algorithm has overfit the training set. Adding more training data will increase the complexity of the training set and help with the variance problem.

•  Try evaluating the hypothesis on a cross validation set rather than the test set.
A cross validation set is useful for choosing the optimal non-model parameters like the regularization parameter λ, but the train / test split is sufficient for debugging problems with the algorithm itself.

•  Try decreasing the regularization parameter λ.
The gap in errors between training and test suggests a high variance problem in which the algorithm has overfit the training set. Decreasing the regularization parameter will increase the overfitting, not decrease it.

•  Try increasing the regularization parameter λ.
The gap in errors between training and test suggests a high variance problem in which the algorithm has overfit the training set. Increasing the regularization parameter will reduce overfitting and help with the variance problem.

1. Suppose you have implemented regularized logistic regression to predict what items customers will purchase on a web shopping site. However, when you test your hypothesis on a new set of customers, you find that it makes unacceptably large errors in its predictions. Furthermore, the hypothesis performs poorly on the training set. Which of the following might be promising steps to take? Check all that apply.

NOTE: Since the hypothesis performs poorly on the training set, it is suffering from high bias (underfitting)

•  Try increasing the regularization parameter λ.
The poor performance on both the training and test sets suggests a high bias problem. Increasing the regularization parameter will allow the hypothesis to fit the data worse, decreasing both training and test set performance.

•  Try decreasing the regularization parameter λ.
Decreasing the regularization parameter will improve the high bias problem and may improve the performance on the training set.

•  Try evaluating the hypothesis on a cross validation set rather than the test set.
You should not use the cross validation set to evaluate performance on new examples since we have used cross validation set to set the regularization parameter, as you will then have an artificially low value for test error and it will not give a good estimate of generalization error.

•  Use fewer training examples.
Using fewer training example will make the situation worse. It will not solve the high bias problem but might increase high variance problem as well.

The poor performance on both the training and test sets suggests a high bias problem. Adding more complex features will increase the complexity of the hypothesis, thereby improving the fit to both the train and test data.

•  Try using a smaller set of features.
The poor performance on both the training and test sets suggests a high bias problem. Using fewer features will decrease the complexity of the hypothesis and will make the bias problem worse

•  Try to obtain and use additional features.
The poor performance on both the training and test sets suggests a high bias problem. Using additional features will increase the complexity of the hypothesis, thereby improving the fit to both the train and test data.

1. Which of the following statements are true? Check all that apply.

•  Suppose you are training a regularized linear regression model. The recommended way to choose what value of regularization parameter to use is to choose the value of which gives the lowest test set error.
You should not use the test set to choose the regularization parameter, as you will then have an artificially low value for test error and it will not give a good estimate of generalization error.

•  Suppose you are training a regularized linear regression model.The recommended way to choose what value of regularization parameter to use is to choose the value of which gives the lowest training set error.
You should not use training error to choose the regularization parameter, as you can always improve training error by using less regularization (a smaller value of ). But too small of a value will not generalize well onthe test set.

•  The performance of a learning algorithm on the training set will typically be better than its performance on the test set.
The learning algorithm finds parameters to minimize training set error, so the performance should be better on the training set than the test set.

•  Suppose you are training a regularized linear regression model. The recommended way to choose what value of regularization parameter to use is to choose the value of which gives the lowest cross validation error.
The cross validation lets us find the “just right” setting of the regularization parameter given the fixed model parameters learned from the training set.

•  A typical split of a dataset into training, validation and test sets might be 60% training set, 20% validation set, and 20% test set.
This is a good split of the data, as it dedicates the bulk of the data to finding model parameters in training while leaving enough data for cross validation and estimating generalization error.

•  Suppose you are training a logistic regression classifier using polynomial features and want to select what degree polynomial (denoted in the lecture videos) to use. After training the classifier on the entire training set, you decide to use a subset of the training examples as a validation set. This will work just as well as having a validation set that is separate (disjoint) from the training set.
cross validation set should not be the subset of training set. Training / Cross validation / Test set should be similar (from same source) but disjoint.

•  It is okay to use data from the test set to choose the regularization parameter λ, but not the model parameters (θ).
We should not use test set data to choose any of the parameters (regularization and model parameters)

•  Suppose you are using linear regression to predict housing prices, and your dataset comes sorted in order of increasing sizes of houses. It is then important to randomly shuffle the dataset before splitting it into training, validation and test sets, so that we don’t have all the smallest houses going into the training set, and all the largest houses going into the test set.
We should shuffle the data before spliting it into training / cross validation / test set.
2. Which of the following statements are true? Check all that apply.

•  A model with more parameters is more prone to overfitting and typically has higher variance.
More model parameters increases the model’s complexity, so it can more tightly fit data in training, increasing the chances of overfitting.

•  If the training and test errors are about the same, adding more features will not help improve the results.
Training and test errors are about the same means model is facing high bias problem. Adding more features will help in solving high bias problem.

•  If a learning algorithm is suffering from high bias, only adding more training examples may not improve the test error significantly.
For solving high bias problem, adding more features useful but adding more training example won’t help.

•  If a learning algorithm is suffering from high variance, adding more training examples is likely to improve the test error.
Adding more training data solves the high variance problem.

•  When debugging learning algorithms, it is useful to plot a learning curve to understand if there is a high bias or high variance problem.
The shape of a learning curve is a good indicator of bias or variance problems with your learning algorithm.

•  If a neural network has much lower training error than test error, then adding more layers will help bring the test error down because we can fit the test set better.
With lower training than test error, the model has high variance. Adding more layers will increase model complexity, making the variance problem worse.

### Machine Learning System Design

1. You are working on a spam classification system using regularized logistic regression. “Spam” is a positive class (y = 1) and “not spam” is the negative class (y = 0). You have trained your classifier and there are m = 1000 examples in the cross-validation set. The chart of predicted class vs. actual class is:

For reference:

Accuracy = (true positives + true negatives) / (total examples)
Precision = (true positives) / (true positives + false positives)
Recall = (true positives) / (true positives + false negatives)
F1 score = (2 * precision * recall) / (precision + recall)

What is the classifier’s F1 score (as a value from 0 to 1)?
Enter your answer in the box below. If necessary, provide at least two values after
the decimal point.
0.16
Precision is 0.087 and recall is 0.85, so F1 score is (2 * precision * recall) /
(precision + recall) = 0.158.
NOTE:
Accuracy = (85 + 10) / (1000) = 0.095
Precision = (85) / (85 + 890) = 0.087
Recall = There are 85 true positives and 15 false negatives, so recall is
85 / (85 + 15) = 0.85.
F1 Score = (2 * (0.087 * 0.85)) / (0.087 + 0.85) = 0.16

1. Suppose a massive dataset is available for training a learning algorithm. Training on a lot of data is likely to give good performance when two of the following conditions hold true.
Which are the two?

•  We train a learning algorithm with a large number of parameters (that is able to learn/represent fairly complex functions).
You should use a “low bias” algorithm with many parameters, as it will be able to make use of the large dataset provided. If the model has too few
parameters, it will underfit the large training set.

•  The features x contain sufficient information to predict accurately. (For example, one way to verify this is if a human expert on the domain can confidently predict when given only ).
It is important that the features contain sufficient information, as otherwise no
amount of data can solve a learning problem in which the features do not
contain enough information to make an accurate prediction.

•  When we are willing to include high order polynomial features of (such as $x_1^2,&space;x_2^2,&space;x_1x_2$, etc.).
As we saw with neural networks, polynomial features can still be insufficient to capture the complexity of the data, especially if the features are very high-dimensional. Instead, you should use a complex model with many parameters to fit to the large training set.

•  We train a learning algorithm with a small number of parameters (that is thus unlikely to overfit).

•  We train a model that does not use regularization.
Even with a very large dataset, some regularization is still likely to help the algorithm’s performance, so you should use cross-validation to select the appropriate regularization parameter.

•  The classes are not too skewed.
The problem of skewed classes is unrelated to training with large datasets.

•  Our learning algorithm is able to represent fairly complex functions (for example, if we train a neural network or other model with a large number of parameters).
You should use a complex, “low bias” algorithm, as it will be able to make use of the large dataset provided. If the model is too simple, it will underfit the large training set.

•  A human expert on the application domain can confidently predict y when given only the features x (or more generally we have some way to be confident that x contains sufficient information to predict y accurately)
This is a nice project commencement briefing.

1. Suppose you have trained a logistic regression classifier which is outputing $\inline&space;h_\theta(x)$.

Currently, you predict 1 if $\inline&space;h_\theta(x)&space;\geq&space;threshold$, and predict 0 if $\inline&space;h_\theta(x)&space;<&space;threshold$, where currently the threshold is set to 0.5.

Suppose you increase the threshold to 0.9. Which of the following are true? Check all that apply.
•  The classifier is likely to have unchanged precision and recall, but higher accuracy.

•  The classifier is likely to now have higher recall.

•  The classifier is likely to now have higher precision.
Increasing the threshold means more y = 0 predictions. This will decrease both true and false positives, so precision will increase.

•  The classifier is likely to have unchanged precision and recall, and thus the same F1 score.

•  The classifier is likely to now have lower recall.
Increasing the threshold means more y = 0 predictions. This increase will decrease the number of true positives and increase the number of false negatives, so recall will decrease.

•  The classifier is likely to now have lower precision.

1. Suppose you have trained a logistic regression classifier which is outputing $\inline&space;h_\theta(x)$.

Currently, you predict 1 if $\inline&space;h_\theta(x)&space;\geq&space;threshold$, and predict 0 if $\inline&space;h_\theta(x)&space;<&space;threshold$, where currently the threshold is set to 0.5.

Suppose you decrease the threshold to 0.3. Which of the following are true? Check all that apply.
•  The classifier is likely to have unchanged precision and recall, but higher accuracy.

•  The classifier is likely to have unchanged precision and recall, but lower accuracy.

•  The classifier is likely to now have higher recall.
Recall = (true positives) / (true positives + false negatives)
Decreasing the threshold means less y = 0 predictions. This will increase true positives and decrease the number of false negatives, so recall will increase.

•  The classifier is likely to now have higher precision.

•  The classifier is likely to have unchanged precision and recall, and thus the same F1 score.

•  The classifier is likely to now have lower recall.

•  The classifier is likely to now have lower precision.
Lowering the threshold means more y = 1 predictions. This will increase both true and false positives, so precision will decrease.

1. Suppose you are working on a spam classifier, where spam emails are positive examples (y = 1) and non-spam emails are negative examples (y = 0). You have a training set of emails in which 99% of the emails are non-spam and the other 1% is spam.

Which of the following statements are true? Check all that apply.
•  A good classifier should have both a high precision and high recall on the cross validation set.
For data with skewed classes like these spam data, we want to achieve a
high F1 score, which requires high precision and high recall.

•  If you always predict non-spam (output y=0), your classifier will have an accuracy of 99%.
Since 99% of the examples are y = 0, always predicting 0 gives an accuracy of 99%. Note, however, that this is not a good spam system, as you will never catch any spam.

•  If you always predict non-spam (output y=0), your classifier will have 99% accuracy on the training set, but it will do much worse on the cross validation set because it has overfit the training data.
The classifier achieves 99% accuracy on the training set because of how skewed the classes are. We can expect that the cross-validation set will be skewed in the same fashion, so the classifier will have approximately the same accuracy.

•  If you always predict non-spam (output y=0), your classifier will have 99% accuracy on the training set, and it will likely perform similarly on the cross validation set.
The classifier achieves 99% accuracy on the training set because of how skewed the classes are. We can expect that the cross-validation set will be skewed in the same fashion, so the classifier will have approximately the same accuracy.
2. Which of the following statements are true? Check all that apply.
•  Using a very large training set makes it unlikely for model to overfit the training data.
A sufficiently large training set will not be overfit, as the model cannot overfit some of the examples without doing poorly on the others.

•  After training a logistic regression classifier, you must use 0.5 as your threshold for predicting whether an example is positive or negative.
You can and should adjust the threshold in logistic regression using cross validation data.

•  If your model is underfitting the training set, then obtaining more data is likely to help.
If the model is underfitting the training data, it has not captured the information in the examples you already have. Adding further examples will not help any more.

•  It is a good idea to spend a lot of time collecting a large amount of data before building your first version of a learning algorithm.
It is not recommended to spend a lot of time collecting a large data

•  On skewed datasets (e.g., when there are more positive examples than negative examples), accuracy is not a good measure of performance and you should instead use F1 score based on the precision and recall.
You can always achieve high accuracy on skewed datasets by predicting the most the same output (the most common one) for every input. Thus the F1 score is a better way to measure performance.

•  The “error analysis” process of manually examining the examples which your algorithm got wrong can help suggest what are good steps to take (e.g., developing new features) to improve your algorithm’s performance.
This process of error analysis is crucial in developing high performance learning systems, as the space of possible improvements to your system is very large, and it gives you direction about what to work on next.

## Machine Learning All Week Coursera Quiz Answer Week-7

### Support Vector Machines

1. Suppose you have trained an SVM classifier with a Gaussian kernel, and it learned the following decision boundary on the training set:

You suspect that the SVM is underfitting your dataset. Should you try increasing or decreasing $\inline&space;C$? Increasing or decreasing $\inline&space;\sigma&space;^2$?

1. Suppose you have trained an SVM classifier with a Gaussian kernel, and it learned the following decision boundary on the training set:

When you measure the SVM’s performance on a cross validation set, it does poorly. Should you try increasing or decreasing $\inline&space;C$? Increasing or decreasing $\inline&space;\sigma&space;^2$?

1. The formula for the Gaussian kernel is given by similarity
$\inline&space;(x,l^{(1)})=exp\left&space;(&space;-\frac{\left&space;\|&space;x-l^{(1)}\right&space;\|^2}{2\sigma^2&space;}\right&space;)$.
The figure below shows a plot of $\inline&space;f_1=smililarity(x,l^{(1)})$ when $\inline&space;\sigma^2$.

Which of the following is a plot of $\inline&space;f_1$ when $\inline&space;\sigma^2&space;=&space;0.25$?
•  Figure 1:

•  Figure 3:

•  Figure 4:

1. The SVM solves
$\inline&space;min_\theta&space;\&space;C\sum_{i-1}^m&space;y^{(i)}&space;cost_1&space;(\theta^T&space;x^{(i)})&space;+&space;(1-y^{(i)})&space;cost_0(\theta^T&space;x^{(i)})&space;+&space;\sum_{j=1}^n&space;\theta_j^2$
where the functions $\inline&space;cost_0(z)$ and $\inline&space;cost_1(z)$ look like this:

The first term in the objective is:
$\inline&space;C\sum_{i=1}^m&space;y^{(i)}&space;cost_1(\theta^T&space;x^{(i)})+(1-y^{(i)})&space;cost_0(\theta^T&space;x^{(i)})$
This first term will be zero if two of the following four conditions hold true. Which are the two conditions that would guarantee that this term equals zero?

1. Suppose you have a dataset with n = 10 features and m = 5000 examples.

After training your logistic regression classifier with gradient descent, you find that it has underfit the training set and does not achieve the desired performance on the training or cross validation sets.

Which of the following might be promising steps to take? Check all that apply.
•  Increase the regularization parameter λ.

•  Use an SVM with a Gaussian Kernel.
By using a Gaussian kernel, your model will have greater complexity and can avoid underfitting the data.

•  Create / add new polynomial features.
When you add more features, you increase the variance of your model, reducing the chances of underfitting.

•  Use an SVM with a linear kernel, without introducing new features.

•  Try using a neural network with a large number of hidden units.
A neural network with many hidden units is a more complex (higher variance) model than logistic regression, so it is less likely to underfit the data.

•  Reduce the number of example in the training set.
2. Which of the following statements are true? Check all that apply.
•  Suppose you are using SVMs to do multi-class classification and would like to use the one-vs-all approach. If you have K different classes, you will train K-1 different SVMs.

•  If the data are linearly separable, an SVM using a linear kernel will return the same parameters θ regardless of the chosen value of C (i.e., the resulting value θ of does not depend on C).

•  It is important to perform feature normalization before using the Gaussian kernel.
The similarity measure used by the Gaussian kernel expects that the data lie in approximately the same range.

•  If you are training multi-class SVMs with one-vs-all method, it is not possible to use a kernel.

## Machine Learning All Week Coursera Quiz Answer Week-8

### Unsupervised Learning :

1. For which of the following tasks might K-means clustering be a suitable algorithm
Select all that apply.
•  Given a set of news articles from many different news websites, find out what are the main topics covered.
K-means can cluster the articles and then we can inspect them or use other methods to infer what topic each cluster represents

•  Given historical weather records, predict if tomorrow’s weather will be sunny or rainy.

•  From the user usage patterns on a website, figure out what different groups of users exist.
We can cluster the users with K-means to find different, distinct groups.

•  Given many emails, you want to determine if they are Spam or Non-Spam emails.

•  Given a database of information about your users, automatically group them into different market segments.
You can use K-means to cluster the database entries, and each cluster will correspond to a different market segment.

•  Given sales data from a large number of products in a supermarket, figure out which products tend to form coherent groups (say are frequently purchased together) and thus should be put on the same shelf.
If you cluster the sales data with K-means, each cluster should correspond to coherent groups of items.

•  Given sales data from a large number of products in a supermarket, estimate future sales for each of these products.

1. Suppose we have three cluster centroids $\inline&space;\mu_1&space;=&space;\begin{bmatrix}&space;1\\&space;2&space;\end{bmatrix}$$\inline&space;\mu_2&space;=&space;\begin{bmatrix}&space;-3\\&space;0&space;\end{bmatrix}$ and $\inline&space;\mu_3&space;=&space;\begin{bmatrix}&space;4\\&space;2&space;\end{bmatrix}$.
Furthermore, we have a training example $\inline&space;x^{(i)}&space;=&space;\begin{bmatrix}&space;-1\\&space;2&space;\end{bmatrix}$. After a cluster assignment
step, what will $\inline&space;C^{(i)}$ be?

1. K-means is an iterative algorithm, and two of the following steps are repeatedly carried out in its inner-loop. Which two?

•  Using the elbow method to choose K.

•  Feature scaling, to ensure each feature is on a comparable scale to the others.

•  Test on the cross-validation set.

•  Randomly initialize the cluster centroids.

1. Suppose you have an unlabeled dataset $\inline&space;\{x^{(1)},&space;...&space;,&space;x^{(m)}\}$. You run K-means with 50 different random initializations, and obtain 50 different clusterings of the data.

What is the recommended way for choosing which one of these 50 clusterings to use?
•  Use the elbow method.

•  Plot the data and the cluster centroids, and pick the clustering that gives the most “coherent” cluster centroids.

•  Manually examine the clusterings, and pick the best one.

•  Always pick the final (50th) clustering found, since by that time it is more likely to have converged to a good solution.

•  The answer is ambiguous, and there is no good way of choosing.

2. Which of the following statements are true? Select all that apply.

•  A good way to initialize K-means is to select K (distinct) examples from the training set and set the cluster centroids equal to these selected examples.
This is the recommended method of initialization.

•  K-Means will always give the same results regardless of the initialization of the centroids.

•  Once an example has been assigned to a particular centroid, it will never be reassigned to another different centroid

•  For some datasets, the “right” or “correct” value of K (the number of clusters) can be ambiguous, and hard even for a human expert looking carefully at the data to decide.
In many datasets, different choices of K will give different clusterings which appear quite reasonable. With no labels on the data, we cannot say one is better than the other.

•  If we are worried about K-means getting stuck in bad local optima, one way to ameliorate (reduce) this problem is if we try using multiple random initializations.
Since each run of K-means is independent, multiple runs can find different optima, and some should avoid bad local optima.

•  Since K-Means is an unsupervised learning algorithm, it cannot overfit the data, and thus it is always better to have as large a number of clusters as is computationally feasible.

### Principal Component Analysis :

1. Consider the following 2D dataset:

Which of the following figures correspond to possible values that PCA may return for $\inline&space;u^{(1)}$ (the first eigen vector / first principal component)? Check all that apply (you may have to check more than one figure).
•  Figure 1:
The maximal variance is along the y = x line, so this option is correct.

•  Figure 2:
The maximal variance is along the y = x line, so the negative vector along that line is correct for the first principal component.

•  Figure 3:

•  Figure 4:

1. Which of the following is a reasonable way to select the number of principal components k?
(Recall that n is the dimensionality of the input data and m is the number of input examples.)
•  Choose k to be the smallest value so that at least 99% of the variance is retained.
This is correct, as it maintains the structure of the data while maximally reducing its dimension.

•  Choose k to be the smallest value so that at least 1% of the variance is retained.

•  Choose k to be 99% of n (i.e., k = 0.99 ∗ n, rounded to the nearest integer).

•  Choose k to be the largest value so that at least 99% of the variance is retained

•  Use the elbow method.

•  Choose k to be 99% of m (i.e., k = 0.99 ∗ m, rounded to the nearest integer).

1. Suppose someone tells you that they ran PCA in such a way that “95% of the variance was retained.” What is an equivalent statement to this?

1. Which of the following statements are true? Check all that apply.

•  Even if all the input features are on very similar scales, we should still perform mean normalization (so that each feature has zero mean) before running PCA.
If you do not perform mean normalization, PCA will rotate the data in a possibly undesired way.

•  PCA is susceptible to local optima; trying multiple random initializations may help.

•  PCA can be used only to reduce the dimensionality of data by 1 (such as 3D to 2D, or 2D to 1D).

•  If the input features are on very different scales, it is a good idea to perform feature scaling before applying PCA.
Feature scaling prevents one feature dimension from becoming a strongvprincipal component only because of the large magnitude of the featurevvalues (as opposed to large variance on that dimension).

•  Feature scaling is not useful for PCA, since the eigenvector calculation (such as using Octave’s svd(Sigma) routine) takes care of this automatically.
2. Which of the following are recommended applications of PCA? Select all that apply.
•  To get more features to feed into a learning algorithm.

•  Data compression: Reduce the dimension of your data, so that it takes up less memory / disk space.
If memory or disk space is limited, PCA allows you to save space in exchange for losing a little of the data’s information. This can be a reasonable tradeoff.

•  Preventing overfitting: Reduce the number of features (in a supervised learning problem), so that there are fewer parameters to learn.

•  Data visualization: Reduce data to 2D (or 3D) so that it can be plotted.
This is a good use of PCA, as it can give you intuition about your data that would otherwise be impossible to see.

•  Data compression: Reduce the dimension of your input data $\inline&space;x^{(i)}$, which will be used in supervised learning algorithm (i.e., use PCA so that your supervised learning algorithm runs faster ).
If your learning algorithm is too slow because of the input dimension is too high, then using PCA to speed it up is a reasonable choice.

•  As a replacement for (or alternative to) linear regression: For most learning applications, PCA and linear regression give sustantially similar results.

•  Data visualization: To take 2D data, and find a different way of plotting it in 2D (using k=2)

## Machine Learning All Week Coursera Quiz Answer Week-9

### Anomaly Detection

1. For which of the following problems would anomaly detection be a suitable algorithm?
•  From a large set of primary care patient records, identify individuals who might have unusual health conditions.

Since you are just looking for unusual conditions instead of a particular disease, this is a good application of anomaly detection.

•  Given data from credit card transactions, classify each transaction according to type of purchase (for example: food, transportation, clothing).

•  Given an image of a face, determine whether or not it is the face of a particular famous individual.

•  Given a dataset of credit card transactions, identify unusual transactions to flag them as possibly fraudulent.

By modeling “normal” credit card transactions, you can then use anomaly detection to flag the unusuals ones which might be fraudulent.

•  In a computer chip fabrication plant, identify microchips that might be defective.

The defective chips are the anomalies you are looking for by modeling the properties of non-defective chips.

•  From a large set of hospital patient records, predict which patients have a particular disease (say, the flu).

1. Suppose you have trained an anomaly detection system for fraud detection, and your system that flags anomalies when p(x) is less than ε, and you find on the cross-validation set that it is missing many fradulent transactions (i.e., failing to flag them as anomalies). What should you do?
•  Increase ε

By increasing ε, you will flag more anomalies, as desired.

•  Decrease ε

1. Suppose you have trained an anomaly detection system for fraud detection, and your system that flags anomalies when p(x) is less than ε, and you find on the cross-validation set that it is mis-flagging far too many good transactions as fradulent. What should you do?
•  Increase ε

•  Decrease ε

By decreasing ε, you will flag fewer anamolies, as desired.

1. Suppose you are developing an anomaly detection system to catch manufacturing defects in airplane engines. You model uses
$\inline&space;p(x)&space;=&space;\prod_{j=1}^{n}&space;p(x_j;\mu_j,\sigma_j^2)$
You have two features $\inline&space;x_1$ = vibration intensity, and $\inline&space;x_2$ = heat generated. Both $\inline&space;x_1$ and $\inline&space;x_2$ take on values between 0 and 1 (and are strictly greater than 0), and for most “normal” engines you expect that $\inline&space;x_1&space;\approx&space;x_2$. One of the suspected anomalies is that a flawed engine may vibrate very intensely even without generating much heat (large $\inline&space;x_1$, small $\inline&space;x_2$), even though the particular values of $\inline&space;x_1$ and $\inline&space;x_2$ may not fall outside their typical ranges of values. What additional feature $\inline&space;x_3$ should you create to capture these types of anomalies:

1. Which of the following are true? Check all that apply.
•  If you do not have any labeled data (or if all your data has label y = 0), then is is still possible to learn p(x), but it may be harder to evaluate the system or choose a good value of ϵ.

Only negative examples are used in training, but it is good to have some labeled data of both types for cross-validation.

•  If you are developing an anomaly detection system, there is no way to make use of labeled data to improve your system.

•  When choosing features for an anomaly detection system, it is a good idea to look for features that take on unusually large or small values for (mainly the) anomalous examples.

These are good features, as they will lie outside the learned model, so you will have small values for p(x) with these examples.

•  If you have a large labeled training set with many positive examples and many negative examples, the anomaly detection algorithm will likely perform just as well as a supervised learning algorithm such as an SVM.

•  In a typical anomaly detection setting, we have a large number of anomalous examples, and a relatively small number of normal/non-anomalous examples.

•  When developing an anomaly detection system, it is often useful to select an appropriate numerical performance metric to evaluate the effectiveness of the learning algorithm.

You should have a good evaluation metric, so you can evaluate changes to the model such as new features.

•  In anomaly detection, we fit a model p(x) to a set of negative ( y=0) examples, without using any positive examples we may have collected of previously observed anomalies.

We want to model “normal” examples, so we only use negative examples in training.

•  When evaluating an anomaly detection algorithm on the cross validation set (containing some positive and some negative examples), classification accuracy is usually a good evaluation metric to use.
2. You have a 1-D dataset $\inline&space;\large&space;\{x^{(1)},&space;...&space;,&space;x^{(m)}&space;\}$ and you want to detect outliers in the dataset. You first plot the dataset and it looks like this:

Suppose you fit the gaussian distribution parameters $\inline&space;\large&space;\mu_1$ and $\inline&space;\large&space;\sigma_1^2$ to this dataset.
Which of the following values for $\inline&space;\large&space;\mu_1$ and $\inline&space;\large&space;\sigma_1^2$ might you get?

### Recommender Systems

1. Suppose you run a bookstore, and have ratings (1 to 5 stars) of books. Your collaborative filtering algorithm has learned a parameter vector $\inline&space;\large&space;\theta^{(j)}$ for user j, and a feature vector $\inline&space;\large&space;x^{(i)}$ for each book. You would like to compute the “training error”, meaning the average squared error of your system’s predictions on all the ratings that you have gotten from your users. Which of these are correct ways of doing so (check all that apply)?For this problem, let m be the total number of ratings you have gotten from your users. (Another way of saying this is that $\inline&space;\large&space;m&space;=&space;\sum_{i=1}^{n_m}&space;\sum_{j=1}^{n_u}&space;r(i,j)$).
[Hint: Two of the four options below are correct.]

1. In which of the following situations will a collaborative filtering system be the most appropriate learning algorithm (compared to linear or logistic regression)?
•  You manage an online bookstore and you have the book ratings from many users. You want to learn to predict the expected sales volume (number of books sold) as a function of the average rating of a book.

•  You’re an artist and hand-paint portraits for your clients. Each client gets a different portrait (of themselves) and gives you 1-5 star rating feedback, and each client purchases at most 1 portrait. You’d like to predict what rating your next customer will give you.

•  You run an online bookstore and collect the ratings of many users. You want to use this to identify what books are “similar” to each other (i.e., if one user likes a certain book, what are other books that she might also like?)

•  You own a clothing store that sells many styles and brands of jeans. You have collected reviews of the different styles and brands from frequent shoppers, and you want to use these reviews to offer those shoppers discounts on the jeans you think they are most likely to purchase.

•  You’ve written a piece of software that has downloaded news articles from many news websites. In your system, you also keep track of which articles you personally like vs. dislike, and the system also stores away features of these articles (e.g., word counts, name of author). Using this information, you want to build a system to try to find additional new articles that you personally will like.

•  You run an online news aggregator, and for every user, you know some subset of articles that the user likes and some different subset that the user dislikes. You’d want to use this to find other articles that the user likes.

•  You manage an online bookstore and you have the book ratings from many users. For each user, you want to recommend other books she will enjoy, based on her own ratings and the ratings of other users.

1. You run a movie empire, and want to build a movie recommendation system based on collaborative filtering. There were three popular review websites (which we’ll call A, B and C) which users to go to rate movies, and you have just acquired all three companies that run these websites. You’d like to merge the three companies’ datasets together to build a single/unified system. On website A, users rank a movie as having 1 through 5 stars. On website B, users rank on a scale of 1 – 10, and decimal values (e.g., 7.5) are allowed. On website C, the ratings are from 1 to 100. You also have enough information to identify users/movies on one website with users/movies on a different website. Which of the following statements is true?
•  You can merge the three datasets into one, but you should first normalize each dataset’s ratings (say rescale each dataset’s ratings to a 0-1 range).

•  You can combine all three training sets into one as long as your perform mean normalization and feature scaling after you merge the data.

•  Assuming that there is at least one movie/user in one database that doesn’t also appear in a second database, there is no sound way to merge the datasets, because of the missing data.

•  It is not possible to combine these websites’ data. You must build three separate recommendation systems.

•  You can merge the three datasets into one, but you should first normalize each dataset separately by subtracting the mean and then dividing by (max – min) where the max and min (5-1) or (10-1) or (100-1) for the three websites respectively.

1. Which of the following are true of collaborative filtering systems? Check all that apply.

•  Suppose you are writing a recommender system to predict a user’s book preferences. In order to build such a system, you need that user to rate all the other books in your training set.

•  Even if each user has rated only a small fraction of all of your products (so r(i, j) = 0 for the vast majority of (i, j) pairs), you can still build a recommender system by using collaborative filtering.

2. Suppose you have two matrices A and B, where is 5×3 and is 3×5. Their product is C = AB, a 5×5 matrix. Furthermore, you have a 5×5 matrix R where every entry is 0 or 1. You want to find the sum of all elements C(i, j) for which the corresponding R(i, j) is 1, and ignore all elements C(i, j) where R(i, j)=0. One way to do so is the following code:

Which of the following pieces of Octave code will also correctly compute this total?
Check all that apply. Assume all options are in code.
•  total = sum(sum((A * B) .* R))

•  C = (A * B) .* R; total = sum(C(:));

•  total = sum(sum((A * B) * R));

•  C = (A * B) * R; total = sum(C(:));

•  C = A * B; total = sum(sum(C(R == 1)));

•  total = sum(sum(A(R == 1) * B(R == 1));

## Machine Learning All Week Coursera Quiz Answer Week-10

### Large Scale Machine Learning :

1. Suppose you are training a logistic regression classifier using stochastic gradient descent. You find that the cost (say, $\inline&space;cost(\theta,(x^{(i)},y^{(i)}))$, averaged over the last 500 examples), plotted as a function of the number of iterations, is slowly increasing over time. Which of the following changes are likely to help?
•  Try using a smaller learning rate α.

•  Try averaging the cost over a larger number of examples (say 1000 examples instead of 500) in the plot.

•  This is not an issue, as we expect this to occur with stochastic gradient descent.

•  Try using a larger learning rate α.

•  Use fewer examples from your training set.

•  Try halving (decreasing) the learning rate α, and see if that causes the cost to now consistently go down; and if not, keep halving it until it does.

•  This is not possible with stochastic gradient descent, as it is guaranteed to converge to the optimal parameters θ.

•  Try averaging the cost over a smaller number of examples (say 250 examples instead of 500) in the plot.

1. Which of the following statements about stochastic gradient descent are true?
Check all that apply.

•  One of the advantages of stochastic gradient descent is that it can start progress in improving the parameters θ after looking at just a single training example; in contrast, batch gradient descent needs to take a pass over the entire training set before it starts to make progress in improving the parameters’ values.

•  Stochastic gradient descent is particularly well suited to problems with small training set sizes; in these problems, stochastic gradient descent is often preferred to batch gradient descent.

•  In each iteration of stochastic gradient descent, the algorithm needs to examine/use only one training example.

•  Before running stochastic gradient descent, you should randomly shuffle (reorder) the training set.

•  If you have a huge training set, then stochastic gradient descent may be much faster than batch gradient descent.

1. Which of the following statements about online learning are true? Check all that apply.
•  One of the disadvantages of online learning is that it requires a large amount of computer memory/disk space to store all the training examples we have seen.

•  In the approach to online learning discussed in the lecture video, we repeatedly get a single training example, take one step of stochastic gradient descent using that example, and then move on to the next example.

•  One of the advantages of online learning is that there is no need to pick a learning rate α.

•  When using online learning, in each step we get a new example (x, y), perform one step of (essentially stochastic gradient descent) learning on that example, and then discard that example and move on to the next.

•  When using online learning, you must save every new training example you get, as you will need to reuse past examples to re-train the model even after you get new training examples in the future.

•  Online learning algorithms are most appropriate when we have a fixed training set of size m that we want to train on.

•  One of the advantages of online learning is that if the function we’re modeling changes over time (such as if we are modeling the probability of users clicking on different URLs, and user tastes/preferences are changing over time), the online learning algorithm will automatically adapt to these changes.

•  Online learning algorithms are usually best suited to problems were we have a continuous/non-stop stream of data that we want to learn from.

1. Assuming that you have a very large training set, which of the following algorithms do you think can be parallelized using map-reduce and splitting the training set across different machines? Check all that apply.
•  A neural network trained using batch gradient descent.

•  Linear regression trained using batch gradient descent.

•  An online learning setting, where you repeatedly get a single example (x, y), and want to learn from that single example before moving on.

•  Logistic regression trained using stochastic gradient descent.

•  Logistic regression trained using batch gradient descent.

•  Logistic regression trained using stochastic gradient descent.

•  Linear regression trained using stochastic gradient descent.
2. Which of the following statements about map-reduce are true? Check all that apply.
•  When using map-reduce with gradient descent, we usually use a single machine that accumulates the gradients from each of the map-reduce machines, in order to compute the parameter update for that iteration.

•  Because of network latency and other overhead associated with map-reduce, if we run map-reduce using N computers, we might get less than an N-fold speedup compared to using 1 computer.

•  If you have only 1 computer with 1 computing core, then map-reduce is unlikely to help.

•  If we run map-reduce using N computers, then we will always get at least an N-fold speedup compared to using 1 computer.

•  In order to parallelize a learning algorithm using map-reduce, the first step is to figure out how to express the main work done by the algorithm as computing sums of functions of training examples.

## Machine Learning All Week Coursera Quiz Answer Week-11

### Application - Photo OCR

1. Suppose you are running a sliding window detector to find text in images. Your input images are 1000×1000 pixels. You will run your sliding windows detector at two scales, 10×10 and 20×20 (i.e., you will run your classifier on lots of 10×10 patches to decide if they contain text or not; and also on lots of 20×20 patches), and you will “step” your detector by 2 pixels each time. About how many times will you end up running your classifier on a single 1000×1000 test set image?
•  250,000

•  500,000
With a stride of 2, you will run your classifier approximately 500 times for each dimension. Since you run the classifier twice (at two scales), you will run it 2 * 500 * 500 = 500,000 times.

•  1,000,000

•  100,000

1. Suppose that you just joined a product team that has been developing a machine learning application, using m = 1,000 training examples. You discover that you have the option of hiring additional personnel to help collect and label data. You estimate that you would have to pay each of the labellers $10 per hour, and that each labeller can label 4 examples per minute. About how much will it cost to hire labellers to label 10,000 new training examples? •$400
On labeller can label 4 × 60 = 240 examples in one hour. It will thus take him 10,000/240 ≈ 40 hours to complete 10,000 examples. At $10 an hour, this is$400.

•  $600 •$10,000

•  \$250

1. What are the benefits of performing a ceiling analysis? Check all that apply.
•  If we have a low-performing component, the ceiling analysis can tell us if that component has a high bias problem or a high variance problem.

•  A ceiling analysis helps us to decide what is the most promising learning algorithm (e.g., logistic regression vs. a neural network vs. an SVM) to apply to a specific component of a machine learning pipeline.

•  It gives us information about which components, if improved, are most likely to have a significant impact on the performance of the final system.
The ceiling analysis gives us this information by comparing the baseline overall system performance with ground truth results from each component of the pipeline

•  It can help indicate that certain components of a system might not be worth a significant amount of work improving, because even if it had perfect performance its impact on the overall system may be small.
An unpromising component will have little effect on overall performance when it is replaced with ground truth.

•  It is a way of providing additional training data to the algorithm.

•  It helps us decide on allocation of resources in terms of which component in a machine learning pipeline to spend more effort on.
The ceiling analysis reveals which parts of the pipeline have the most room to improve the performance of the overall system.

1. Suppose you are building an object classifier, that takes as input an image, and recognizes that image as either containing a car (y = 1) or not (y = 0). For example, here are a positive example and a negative example:

After carefully analyzing the performance of your algorithm, you conclude that you need more positive (y = 1) training examples. Which of the following might be a good way to get additional positive examples?
•  Mirror your training images across the vertical axis (so that a left-facing car now becomes a right-facing one).
A mirrored example is different from the original but equally likely to occur, so mirroring is a good way to generate new data.

•  Take a few images from your training set, and add random, Gaussian noise to every pixel.

•  Take a training example and set a random subset of its pixel to 0 to generate a new example.

•  Select two car images and average them to make a third example.

•  Apply translations, distortions, and rotations to the images already in your training set.
These geometric distortions are likely to occur in real-world images, so they are a good way to generate additional data.

•  Make two copies of each image in the training set; this immediately doubles your training set size.
2. Suppose you have a PhotoOCR system, where you have the following pipeline:

You have decided to perform a ceiling analysis on this system, and find the following:

Which of the following statements are true?
•  There is a large gain in performance possible in improving the character recognition system.
Plugging in ground truth character recognition gives an 18% improvement over running the character recognition system on ground truth character segmentation. Thus there is a good deal of room for overall improvement by improving character recognition.

•  Performing the ceiling analysis shown here requires that we have ground-truth labels for the text detection, character segmentation and the character recognition systems.
At each step, we provide the system with the ground-truth output of the previous step in the pipeline. This requires ground truth for every step of the pipeline.

•  The potential benefit to having a significantly improved text detection system is small, and thus it may not be worth significant effort trying to improve it.
Plugging in ground truth text detection improved the overall system by only 2%, so it is not a good candidate for development effort.

•  The least promising component to work on is the character recognition system, since it is already obtaining 100% accuracy.

•  The most promising component to work on is the text detection system, since it has the lowest performance (72%) and thus the biggest potential gain.

•  We should dedicate significant effort to collecting additional training data for the text detection system.

•  If the text detection system was trained using gradient descent, running gradient descent for more iterations is unlikely to help much.
Plugging in ground truth text detection improved the overall system by only 2%, so even if you could improve text detection performance with more gradient descent iterations, this would have minimal impact on the overall system performance.

•  If we conclude that the character recognition’s errors are mostly due to the
character recognition system having high variance, then it may be worth significant effort obtaining additional training data for character recognition.
Since the biggest improvement comes from character recognition ground truth, we would like to improve the performance of that system. It the character recognition system has high variance, additional data will improve its performance.

## Conclution:

We hope you will do well in your Machine Learning Coursera Quiz & Assessment Answers by our article. If you think It helps you a little please share it with your friends. And Stay with queryfor.com for any kind of Exam or quiz Answer. We also provide Coursera Quiz answer, Coursehero free Unlock