Coursera ML Notes

WEEK 1 - Introduction


  • Database mining
    • Clickstream data,web click data 分析,已经成为硅谷一项巨大的产业
  • Applications can’t program by hand
    • Autonomous helicopter 飞行器,很难写个程序让飞行器飞行,而只能让它“自主”飞行。

Supervised Learning & Unsupervised Learning

Supervised Learning

In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output.

Supervised learning problems are categorized into “regression” and “classification” problems. In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories.

Unsupervised Learning

Unsupervised learning allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don’t necessarily know the effect of the variables.

We can derive this structure by clustering the data based on relationships among the variables in the data.

With unsupervised learning there is no feedback based on the prediction results.

Google News 对新闻聚类,找出相同 topic 的新闻。


Clustering: Take a collection of 1,000,000 different genes, and find a way to automatically group these genes into groups that are somehow similar or related by different variables, such as lifespan, location, roles, and so on.

Non-clustering: The “Cocktail Party Algorithm”, allows you to find structure in a chaotic environment. (i.e. identifying individual voices and music from a mesh of sounds at a cocktail party). “鸡尾酒会算法”,允许你在混乱的环境中找到结构。(例如,从鸡尾酒会的一对声音中识别出单个声音和音乐)。

Model and Cost Function

Our goal is, given a training set, to learn a function h : X → Y so that h(x) is a “good” predictor for the corresponding value of y. For historical reasons, this function h is called a hypothesis .

We can measure the accuracy of our hypothesis function by using a cost function . This takes an average difference (actually a fancier version of an average) of all the results of the hypothesis with inputs from x’s and the actual output y’s.

\[J(\theta_0, \theta_1) = {\frac 1 {2m}} \sum_{i=1}^m (\hat{y_i} - y_i)^2 = {\frac 1 {2m}} \sum_{i=1}^m (h_\theta(x_i) - y_i)^2\]

This function is otherwise called the “Squared error function” , or “Mean squared error” . The mean is halved \({\frac 1 2}\) as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the \({\frac 1 2}\) term.

A contour plot is a graph that contains many contour lines. A contour line of a two variable function has a constant value at all points of the same line.


Gradient Descent

The way we do this is by taking the derivative (the tangential line to a function) of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards . We make steps down the cost function in the direction with the steepest descent. The size of each step is determined by the parameter α, which is called the learning rate .

The gradient descent algorithm is:

repeat until convergence:

\[\theta_j : = \theta_j − α {\frac \partial {\partial \theta_j}} J(\theta_0,\theta_1)\]

where j=0,1 represents the feature index number.

At each iteration j, one should simultaneously update the parameters θ1,θ2,…,θn. Updating a specific parameter prior to calculating another one on the j(th) iteration would yield to a wrong implementation.

How does gradient descent converge with a fixed step size α?

Gradient descent can converge to a local minimum, even with the learning rate α fixed.

As we approach a local minimum, gradient descent will automatically take smaller steps. So, no need to decrease α over time.

WEEK 3 - Logistic Regression

Classification and Representation - Hypothesis Representation

Let’s change the form for our hypotheses \(h_θ(x)\) to satisfy \(0≤h_θ(x)≤1\). This is accomplished by plugging \(θ^Tx\) into the Logistic Function. Our new form uses the “Sigmoid Function” , also called the “Logistic Function” :

\[h_\theta(x) = g(\theta^T x) = {\frac 1 {1 + e^{-z}}}\]

\(h_\theta(x)\) will give us the probability that our output is 1. For example, hθ(x)=0.7 gives us a probability of 70% that our output is 1. Our probability that our prediction is 0 is just the complement of our probability that it is 1 (e.g. if probability that it is 1 is 70%, then the probability that it is 0 is 30%).

使用 google chart 搜索可以用 = 1 / (1 + e^(-z))

Logistic Regression Model - Cost Function

We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima . In other words, it will not be a convex function.

就是说把假设函数 sigmoid function 带入到梯度下降的 cost function 中,得到的是一个非凸函数。

TODO 为什么会得到一个 凸函数 ,怎么画出来?

Instead, our cost function for logistic regression looks like:

\[ \begin{align}\begin{aligned}J(\theta)= {\frac 1 m } \sum_{i=1}^m Cost(h_\theta(x(i)),y(i))\\Cost(h_θ(x),y) = −log(h_θ(x)), if y = 1\\Cost(h_θ(x),y) = −log(1−h_θ(x)), if y = 0\end{aligned}\end{align} \]

Before starting to implement any learning algorithm, it is always good to visualize the data if possible.

WEEK 4 - Neural Networks

Model Representation

In this model our \(x_0\) input node is sometimes called the ” bias unit .” It is always equal to 1. In neural networks, we use the same logistic function as in classification, \(\frac 1 {1+e^{-θ^Tx}}\) , yet we sometimes call it a sigmoid (logistic) activation function. In this situation, our “theta” parameters are sometimes called “weights”.

input layer -> hidden layers -> output layer

If network has \(s_j\) units in layer j and \(s_{j+1}\) units in layer j+1, then \(Θ^{(j)}\) will be of dimension \(s_{j+1}×(s_j+1)\) .

Applications - Examples and Intuitions 应用 - 例子和直觉

We can constructed the fundamental operations in computers by using a small neural network rather than using an actual AND / OR gate. 可以用小的神经网络而不是实际的与非门来模拟基础的逻辑运算。

例如 AND 运算:

\[h_\theta (x) = g(-30 + 20x_2 + 20x_2)\]

OR 运算:

\[h_\theta (x) = g(-10 + 20x_2 + 20x_2)\]

把你想要否定的变量前放一个大的负数,就可以表示 NOT :

\[h_\theta (x) = g(10 - 20x)\]

混和前面的运算可以得到 (NOT x_1) and (NOT x_2):

\[h_\theta (x) = g(10 - 20x_1 - 20x_2)\]

使用多层混合就可以计算相当复杂的非线性函数,比如 XOR,对 input layer 分别 AND 和 (NOT x_1) and (NOT x_2),得到一个 hidden layer,对 hidden layer 继续执行 OR,就得到了一个非线性的 XOR 作为 output layer。


Multiclass Classification

和前面逻辑回归一样,都转化为 One-vs-all 的做法求解。对于 4 个分类的模型,的结果集 y 是以下四个向量之一:

y = [1 0 0 0], [0 1 0 0], [0 0 1 0], [0 0 0 1] // 向量应该是竖着的

// 这里开始是 WEEK 5 - Neural Networks: Learning,继续将神经网络的 cost function 和 backpropagation algorithm

Cost function


  • L = total number of layers in the network
  • sl = number of units (not counting bias unit) in layer l
  • K = number of output units/classes

神经网络的 cost function 和前面 logistic regression 的一样,只是对 K 个 output units 中的每一个都做了累加,regulation item 也包含了中间每一层 hidden layer 的所有 \(\theta\) 参数累加:

\[J(\theta) = - {\frac 1 m} \sum_{i=1}^m \sum_{k=1}^K [y_k(i)log((h_\theta(x^{(i)}))_k) + (1-y_k^{(i)}) log(1 - (h_\theta(x^{(i)}))_k)] + {\frac \lambda {2m}} \sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_l + 1} (\theta_{j,i}^{(l)})^2\]


  • the double sum simply adds up the logistic regression costs calculated for each cell in the output layer
  • the triple sum simply adds up the squares of all the individual Θs in the entire network.
  • the i in the triple sum does not refer to training example i

Backpropagation Algorithm

用来计算优化 cost function 时需要的偏微分,给定一个 \(\theta\) ,先正向获得每一层的值,然后再方向倒推出偏微分。

微积分不好的我,没有太看懂倒推的逻辑。 TODO

直觉上,\(δ_j^{(l)}\) 是由 \(a_j^{(;)}\) 衍生出来的下一级元素的偏差偏差之和,