Coursera ML Notes¶

WEEK 1 - Introduction¶

Welcome¶

• Database mining
• Clickstream data，web click data 分析，已经成为硅谷一项巨大的产业
• Applications can’t program by hand
• Autonomous helicopter 飞行器，很难写个程序让飞行器飞行，而只能让它“自主”飞行。

Supervised Learning & Unsupervised Learning¶

Supervised Learning¶

https://www.coursera.org/learn/machine-learning/supplement/NKVJ0/supervised-learning

In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output.

Supervised learning problems are categorized into “regression” and “classification” problems. In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories.

Unsupervised Learning¶

Unsupervised learning allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don’t necessarily know the effect of the variables.

We can derive this structure by clustering the data based on relationships among the variables in the data.

With unsupervised learning there is no feedback based on the prediction results.

Google News 对新闻聚类，找出相同 topic 的新闻。

Example:

Clustering: Take a collection of 1,000,000 different genes, and find a way to automatically group these genes into groups that are somehow similar or related by different variables, such as lifespan, location, roles, and so on.

Non-clustering: The “Cocktail Party Algorithm”, allows you to find structure in a chaotic environment. (i.e. identifying individual voices and music from a mesh of sounds at a cocktail party). “鸡尾酒会算法”，允许你在混乱的环境中找到结构。（例如，从鸡尾酒会的一对声音中识别出单个声音和音乐）。

Model and Cost Function¶

Our goal is, given a training set, to learn a function h : X → Y so that h(x) is a “good” predictor for the corresponding value of y. For historical reasons, this function h is called a hypothesis .

We can measure the accuracy of our hypothesis function by using a cost function . This takes an average difference (actually a fancier version of an average) of all the results of the hypothesis with inputs from x’s and the actual output y’s.

$J(\theta_0, \theta_1) = {\frac 1 {2m}} \sum_{i=1}^m (\hat{y_i} - y_i)^2 = {\frac 1 {2m}} \sum_{i=1}^m (h_\theta(x_i) - y_i)^2$

This function is otherwise called the “Squared error function” , or “Mean squared error” . The mean is halved $${\frac 1 2}$$ as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the $${\frac 1 2}$$ term.

A contour plot is a graph that contains many contour lines. A contour line of a two variable function has a constant value at all points of the same line.

Gradient Descent¶

The way we do this is by taking the derivative (the tangential line to a function) of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards . We make steps down the cost function in the direction with the steepest descent. The size of each step is determined by the parameter α, which is called the learning rate .

The gradient descent algorithm is:

repeat until convergence:

$\theta_j : = \theta_j − α {\frac \partial {\partial \theta_j}} J(\theta_0,\theta_1)$

where j=0,1 represents the feature index number.

At each iteration j, one should simultaneously update the parameters θ1,θ2,…,θn. Updating a specific parameter prior to calculating another one on the j(th) iteration would yield to a wrong implementation.

How does gradient descent converge with a fixed step size α?¶

Gradient descent can converge to a local minimum, even with the learning rate α fixed.

As we approach a local minimum, gradient descent will automatically take smaller steps. So, no need to decrease α over time.

WEEK 3 - Logistic Regression¶

Classification and Representation - Hypothesis Representation¶

Let’s change the form for our hypotheses $$h_θ(x)$$ to satisfy $$0≤h_θ(x)≤1$$. This is accomplished by plugging $$θ^Tx$$ into the Logistic Function. Our new form uses the “Sigmoid Function” , also called the “Logistic Function” :

$h_\theta(x) = g(\theta^T x) = {\frac 1 {1 + e^{-z}}}$

$$h_\theta(x)$$ will give us the probability that our output is 1. For example, hθ(x)=0.7 gives us a probability of 70% that our output is 1. Our probability that our prediction is 0 is just the complement of our probability that it is 1 (e.g. if probability that it is 1 is 70%, then the probability that it is 0 is 30%).

Logistic Regression Model - Cost Function¶

We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima . In other words, it will not be a convex function.

TODO 为什么会得到一个 凸函数 ，怎么画出来？

Instead, our cost function for logistic regression looks like:

\begin{align}\begin{aligned}J(\theta)= {\frac 1 m } \sum_{i=1}^m Cost(h_\theta(x(i)),y(i))\\Cost(h_θ(x),y) = −log(h_θ(x)), if y = 1\\Cost(h_θ(x),y) = −log(1−h_θ(x)), if y = 0\end{aligned}\end{align}

Before starting to implement any learning algorithm, it is always good to visualize the data if possible.

WEEK 4 - Neural Networks¶

Model Representation¶

In this model our $$x_0$$ input node is sometimes called the ” bias unit .” It is always equal to 1. In neural networks, we use the same logistic function as in classification, $$\frac 1 {1+e^{-θ^Tx}}$$ , yet we sometimes call it a sigmoid (logistic) activation function. In this situation, our “theta” parameters are sometimes called “weights”.

input layer -> hidden layers -> output layer

If network has $$s_j$$ units in layer j and $$s_{j+1}$$ units in layer j+1, then $$Θ^{(j)}$$ will be of dimension $$s_{j+1}×(s_j+1)$$ .

Applications - Examples and Intuitions 应用 - 例子和直觉¶

We can constructed the fundamental operations in computers by using a small neural network rather than using an actual AND / OR gate. 可以用小的神经网络而不是实际的与非门来模拟基础的逻辑运算。

$h_\theta (x) = g(-30 + 20x_2 + 20x_2)$

OR 运算:

$h_\theta (x) = g(-10 + 20x_2 + 20x_2)$

$h_\theta (x) = g(10 - 20x)$

$h_\theta (x) = g(10 - 20x_1 - 20x_2)$

Multiclass Classification¶

y = [1 0 0 0], [0 1 0 0], [0 0 1 0], [0 0 0 1] // 向量应该是竖着的

// 这里开始是 WEEK 5 - Neural Networks: Learning，继续将神经网络的 cost function 和 backpropagation algorithm

Cost function¶

• L = total number of layers in the network
• sl = number of units (not counting bias unit) in layer l
• K = number of output units/classes

$J(\theta) = - {\frac 1 m} \sum_{i=1}^m \sum_{k=1}^K [y_k(i)log((h_\theta(x^{(i)}))_k) + (1-y_k^{(i)}) log(1 - (h_\theta(x^{(i)}))_k)] + {\frac \lambda {2m}} \sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_l + 1} (\theta_{j,i}^{(l)})^2$

Note:

• the double sum simply adds up the logistic regression costs calculated for each cell in the output layer
• the triple sum simply adds up the squares of all the individual Θs in the entire network.
• the i in the triple sum does not refer to training example i