Mahine Learning in Action - Notes


  1. 分类
  • k-Nearest Neighbors
  • 决策树
  • 使用概率分布算法分类,朴素贝叶斯
  • Logistic 回归 - 算法优化 - 如果处理缺失值
  • 支持向量机
  • AdaBoost 集成方法 - 训练样本非均匀分布时所引发的非均衡分类问题
  1. 利用回归预测数值型数据
  • 回归、去噪、局部加权线性回归、偏差方差折中
  • 基于树的回归算法和分类回归树(CART)
  1. 无监督学习。无须用户知道搜寻目标,由算法得到共同特征。
  • K-means 聚类算法
  • 用户关联分析的 Aprior 算法
  • FG-Growth 改进关联分析
  1. 附属内容
  • 两个数学运算工具,用于消除噪音:主成分分析和奇异值分解
  • MapReduce 架构,分布式计算

Chapter 1 机器学习基础

在社会科学领域,正确率达 60% 以上的分析被认为是非常成功的。



unsupervised learning

  • In unsupervised learning, we may also want to find statistical values that describe the data. This is known as density estimation. 使用聚类算法分组后,如果还需要估计数据雨每个分组的相似程度,则需要使用 density estimation .
  • Another task of unsupervised learning may be reducing the data from many features to a small number so that we can properly visualize it in two or three dimensions. 减少数据维度。


  • 是否为空值
  • 是否可以识别出模式
  • 是否有明显的异常值
  • 通过一二三维图展示数据


python 开发环境将来会集成 Pylab 模块,它将 NumPy, SciPy 和 Matplotlib 合并为一个开发环境。

Chapter 2 Classifying with k-Nearest Neighbors

For every point in our dataset:

  1. calculate the distance between inX and the current point
  2. sort the distances in increasing order
  3. take k items with lowest distances to inX
  4. find the majority class among these items
  5. return the majority class as our prediction for the class of inX
from numpy import *
import operator

def classify0(inX, dataSet, labels, k):
    '''k-Nearest Neighbor'''
    dataSetSize = dataSet.shape[0]
    diffMat = tile(inX, (dataSetSize,1)) - dataSet
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances**0.5
    sortedDistIndicies = distances.argsort()
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

def createDataSet():
    group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
    labels = ['A','A','B','B']
    return group, labels

group, labels = createDataSet()
print classify0([0,0], group, labels, 3)

Chapter 3 Splitting datasets one feature at a time: decision trees

Information gain

Shannon entropy

The measure of information of a set is known as the Shannon entropy, or just entropy for short. Its name comes from the father of information theory, Claude Shannon.

Entropy is defined as the expected value of the information . First, we need to define information . If you’re classifying something that can take on multiple values, the information for symbol :math:`x_i` is defined as

\[l(x_i) = log_2 p(x_i)\]

where \(p(x_i)\) is the probability of choosing this class.

To calculate entropy, you need the expected value of all the information of all possible values of our class . This is given by

\[H = - \sum_{i=1}^n p(x_i) log_2 p(x_i)\]

where n is the number of classes.

Function to calculate the Shannon entropy of a dataset

from math import log
from collections import Counter
def calShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = Counter([featVec[-1] for featVec in dataSet])
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob, 2)
    return shannonEnt
def createDataSet():
    dataSet = [[1, 1, 'yes'],
               [1, 1, 'yes'],
               [1, 0, 'no'],
               [0, 1, 'no'],
               [0, 1, 'no']]
    labels = ['no surfacing', 'flippers']
    return dataSet, labels

See in action:

>>> myData, labels = createDataSet()
>>> calcShannonEnt(myData)

The higher the entropy, the more mixed up the data is.

Splitting the dataset

Let’s split the dataset in a way that will give us the largest information gain . We won’t know how to do that unless we actually split the dataset and measure the information gain.

def splitDataSet(dataSet, axis, value):
    '''Dataset splitting on a given feature.

        dataSet: the dataset we'll split.
        axis: the feature we'all split on.
        value: the value of the feature to return.
    return [featVec[:axis] + feacVec[axis+1:] for featVec in dataSet
                if featVec[axis] == value]

See in action:

>>> splitDataSet(myData, 0, 1)
[[1, 'yes'], [1, 'yes'], [0, 'no']]
>>> splitDataSet(myData, 0, 0)
[[1, 'no'], [1, 'no']]

You’re now going to combine the Shannon entropy calculation and the splitDataSet() function to cycle through the dataset and decide which feature is the best to split on.

def chooseBestFeatureToSplit(dataSet):
    '''Choose the best feature to split on.'''
    numFeatures = len(dataSet[0]) - 1
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0
    bestFeature = -1
    for i in range(numFeatures):
        # Calc entropy for each split
        uniqueVals = set(example[i] for example in dataSet)
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet) / float(len(dataSet))
            newEntropy += prob * calShannonEnt(subDataSet)
        infoGain = baseEntropy - newEntropy
        if (infoGain > bestInfoGain):
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature

See in action:

>>> chooseBestFeatureToSplit(myData)

Recursively building the tree

You’ll stop under the following conditions: you run out of attributes on which to split or all the instances in a branch are the same class. If all instances have the same class, then you’ll create a leaf node, or terminating block.

The first stopping condition makes this algorithm tractable, and you can even set a bound on the maximum number of splits you can have. If our dataset has run out of attributes but the class labels are not all the same, you must decide what to call that leaf node. In this situation, you’ll take a majority vote.

from collections import Counter
def majorityCnt(classList):
    '''Take a majority vote.'''
    return Counter(classList).most_common()[0][0]
def createTree(dataSet, labels):
        labels: The list of labels contains a label for each of the features in the dataset.
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList): # Stop when all classes are equal
        return classList[0]
    if len(dataSet[0]) == 1: # When no more features, return majority
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel: {}}
    uniqueFeatValues = set(example[bestFeat] for example in dataSet)
    for value in uniqueFeatValues:
        subLabels = labels[:]
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat. value), subLabels)
    return myTree

TODO createTree 的代码应该多看几遍

TODO 怎么持久存储一棵树

You’ll use the Python dictionary to store the tree.

See in action:

>>> myData, labels = createDataSet()
>>> myTree = createTree(myData, labels)
>>> myTree
{'no surpacing': {0, 'no': 1: {'flippers': {0: 'no', 1: 'yes'}}}}

In our example, we have three leaf nodes and two decision nodes .

Testing and storing the classifier

In this section, you’ll build a classifier that uses our tree, and then you’ll see how to persist that classifier on disk for longer storage in a real application. Finally, you’ll put our decision tree code to use on some real data to see if you can predict what type of contact lenses a person should use.

You want to put our tree to use doing some classification after you’ve learned the tree from our training data, but how do you do that? You need our tree and the label vector that you used in creating the tree. The code will then take the data under test and compare it against the values in the decision tree. It will do this recursively until it hits a leaf node; then it will stop because it has arrived at a conclusion.

def classify(inputTree, featLabels, testVec):
    # Classification function for an existing decision tree
    firstStr = inputTree.keys()[0]
    secondDict = inputTree(firstStr)
    featIndex = featLabels.index(firstStr)
    for key in secondeDict.keys():
        if testVec[featIndex] == key:
            if type(secondDict[key]).__name__ == 'dict':
                # 递归过程中,testVec 自身是不改变的。枝干自然会只取自己需要的信息。
                classLabel = classify(secondDict[key], featLabels, testVec)
                classLabel = secondDict[key]
     return classLabel

See in Action:

>>> myDat,labels=trees.createDataSet()
>>> labels
['no surfacing', 'flippers']
>>> myTree=treePlotter.retrieveTree (0)
>>> myTree
{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}
 >>> trees.classify(myTree,labels,[1,0])
 >>> trees.classify(myTree,labels,[1,1])


The tree in figure 3.8 matches our data well; however, it probably matches our data too well. This problem is known as overfitting. In order to reduce the problem of overfitting, we can prune the tree . This will go through and remove some leaves. If a leaf node adds only a little information, it will be cut off and merged with another leaf. We’ll investigate this further when we revisit decision trees in chapter 9.

In chapter 9 we’ll also investigate another decision tree algorithm called CART. The algorithm we used in this chapter, ID3, is good but not the best. ID3 can’t handle numeric values. We could use continuous values by quantizing them into discrete bins, but ID3 suffers from other problems if we have too many splits.

A decision tree classifier is just like a work-flow diagram with the terminating blocks representing classification decisions. Starting with a dataset, you can measure the inconsistency of a set or the entropy to find a way to split the set until all the data belongs to the same class. The ID3 algorithm can split nominal-valued datasets. Recursion is used in tree-building algorithms to turn a dataset into a decision tree. The tree is easily represented in a Python dictionary rather than a special data structure.

The first two chapters in this book have drawn hard conclusions about data such as “This data instance is in this class!” What if we take a softer approach, such as “Well, I’m not quite sure where that data should go. Maybe here? Maybe there?” What if we assign a probability to a data instance belonging to a given class? This will be the focus of the next chapter.

Chapter 4 Classifying with probability theory: naïve Bayes

This chapter covers

  • Using probability distributions for classification
  • Learning the naïve Bayes classifier
  • Parsing data from RSS feeds
  • Using naïve Bayes to reveal regional attitudes

// TODO reginal attributes 具体是指什么?

In the first two chapters we asked our classifier to make a definite answer for the question “Which class does this data instance belong to?” Sometimes the classifier got the answer wrong. We could instead ask the classifier to give us a best guess about the class and assign a probability estimate to that best guess.

We start out with the simplest probabilistic classifier and then make a few assumptions and learn the naïve Bayes classifier . It’s called naïve because the formulation makes some naïve assumptions. We’ll take full advantage of Python’s text-processing abilities to split up a document into a word vector . This will be used to classify text. Finally, we’ll show how you can put what the classifier has learned into human-readable terms from a bunch of personal ad postings.

Classifying with Bayesian decision theory

Naive Bayes

  • Pros: Works with a small amount of data, handles multiple classes
  • Cons: Sensitive to how the input data is prepared
  • Works with: Nominal values // 标称值

Naive Bayes is a subset of Bayesian decision theory.


  • statiistical parameters 是什么?这些点的分类信息是已知的吗(training set)?
  • 只有这样才能够使用 kNN 和 decision tree.
  • 怎么 Compute the probability of each class ?
Bayesian probability allows prior knowledge and logic to be applied to uncertain statements . There’s another interpretation called frequency probability, which only draws conclusions from data and doesn’t allow for logic and prior knowledge .

Conditional probability:

\[p(x|c) = \frac {p(x \land c)} {p(c)}\]

Bayes’ rule:

\[p(c|x) = \frac {p(x|c) p(c)} {p(x)}\]

Document classification with naïve Bayes

General approach to naïve Bayes

  • Analyze: With many features, plotting features isn’t helpful. Looking at histograms is a better idea.
  • Train: Calculate the conditional probabilities of the independent features.
  • Test: Calculate the error rate.

Statistics tells us that if we need N samples for one feature, we need :math:`N^10` for 10 features and :math”`N^1000` for our 1,000-feature vocabulary.

If we assume independence among the features, then our \(N^1000\) data points get reduced to 1000*N. By independence I mean statistical independence; one feature or word is just as likely by itself as it is next to other words . The other assumption we make is that every feature is equally important. We know that isn’t true either. If we were trying to classify a message board posting as inappropriate, we probably don’t need to look at 1,000 words; maybe 10 or 20 will do. Despite the minor flaws of these assumptions, naïve Bayes works well in practice.

TODO 统计学中 N^1000 和 1000*N 这两个数字是怎么来的?

Classifying text with Python

\[p(c_i|w) = \frac {p(w|c_i)p(c_i)} {p(w)}\]

How can we get \(p(w|c_i)\) ? This is where our naïve assumption comes in. If we expand w into individual features , we could rewrite this as \(p(w_0,w_1,w_2..w_N|c_i)\) . Our assumption that all the words were independently likely, and something called conditional independence , says we can calculate this probability as p(w0|ci)p(w1|ci)p(w2|ci)…p(wN|ci). This makes our calculations a lot easier.

# Word list to vector

def loadDataSet():
    postingList = [['my', 'dog', 'has', 'flea', \
                    'problems', 'help', 'please'],
                   ['maybe', 'not', 'take', 'him', \
                    'to', 'dog', 'park', 'stupid'],
                   ['my', 'dalmation', 'is', 'so', 'cute', \
                     'I', 'love', 'him'],
                   ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                   ['mr', 'licks', 'ate', 'my', 'steak', 'how',\
                     'to', 'stop', 'him'],
                   ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0,1,0,1,0,1] # 1 is abusive, 0 not
    return postingList, classVec

def createVocabList(dataSet):
    return list(set(sum(dataSet, [])))

def setOfWord2Vec(vocabList, inputSet):
    returnVec = [0] * len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1
            print "the word: %s is not in my Vocabulary!" % word
    return returnVec
def trainNB0(trainMatrix, trainCategory):
    '''Naive Bayes classifier training function'''
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0]) # trainMatrix 中每一行的 len 是一样的,都是 vocabList 的长度
    pAbusive = sum(trainCategory) / float(numTrainDocs)
    # p0Num = zeros(numWords); p1Num = zeros(numWords)  # numerator
    # p0Denom = 0.0, p1Denom = 0.0  # denominator
    p0Num = ones(numWords); p1Num = ones(numWords)
    p0Denom = 2.0; p1Denom = 2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i] # 向量操作
            p1Denom += sum(trainMatrix[i])
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    # p1Vect = p1Num / p1Denom # category=p1 中每个 token 的概率, 即 p(wi|c0)
    # p0Vect = p0Num / p0Denom # category=p0 中每个 token 的概率,即 p(wi|c1)
    # use log to avoid underflow
    p1Vect = log(p1Num / p1Denom)
    p2Vect = log(p2Num / p2Denom)
    return p0Vect, p1Vect, pAbusive # p(wi|c0), p(wi|c1), p(c1)

See in action:

>>> from numpy import *
>>> listOPosts,listClasses = loadDataSet()
This loads the data from preloaded values.
>>> myVocabList = createVocabList(listOPosts)
You’ve now created a list of all our words in myVocabList.
>>> trainMat=[]
>>> for postinDoc in listOPosts:
... trainMat.append(bayes.setOfWords2Vec(myVocabList, postinDoc))
>>> p0V,p1V,pAb=bayes.trainNB0(trainMat,listClasses)

Before we can go on to classification with this, we need to address a few flaws in the previous function.

Test: modifying the classifier for real-world conditions

When we attempt to classify a document, we multiply a lot of probabilities together to get the probability that a document belongs to a given class. This will look something like p(w0|1)p(w1|1)p(w2|1). If any of these numbers are 0, then when we multiply them together we get 0 . To lessen the impact of this, we’ll initialize all of our occurrence counts to 1, and we’ll initialize the denominators to 2.

# TODO 1 和 2 这两个值是怎么确定的?

Another problem is underflow: doing too many multiplications of small numbers. When we go to calculate the product p(w0|ci)p(w1|ci)p(w2|ci)…p(wN|ci) and many of these numbers are very small, we’ll get underflow, or an incorrect answer. (Try to multiply many small numbers in Python. Eventually it rounds off to 0.) One solution to this is to take the natural logarithm of this product . If you recall from algebra, :math:`ln(a*b)=ln(a)+ln(b)` . Doing this allows us to avoid the underflow or round-off error problem . Do we lose anything by using the natural log of a number rather than the number itself? The answer is no. Figure 4.4 plots two functions, f(x) and ln(f(x)). If you examine both of these plots, you’ll see that they increase and decrease in the same areas, and they have their peaks in the same areas. Their values are different, but that’s fine.

# TODO f(x) 可以替换成 ln(x) 的数学推理?

# Naive Bayes classify function

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    # the multiplication is element-wise
    p1 = sum(vec2Classify * p1Vec) + log(pClass1) # TODO 这里对 log 用加法的原因什么?
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
        return 0

def testingNB():
    listOPost, listClasses = loadDataSet()
    myVocabList = createVocabList(listOPosts)
    trainMat = []
    for postinDoc in listOPosts:
        trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
    p0V, p1V, pAb = trainNB(array(trainMat), array(listClasses))
    testEntry = ['love', 'my', 'dalmation']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb)
    testEntry = ['stupid', 'garbage']
    thisDoc = array(setofWords2Vec(myVocabList, testEntry))
    print testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb)

See in action:

['love', 'my', 'dalmation'] classified as:  0
['stupid', 'garbage'] classified as:  1

This example is overly simplistic, but it demonstrates how the naïve Bayes classifier works. We’ll next make a few changes to it so that it will work even better.

Prepare: the bag-of-words document model

Up until this point we’ve treated the presence or absence of a word as a feature. This could be described as a set-of-words model. If a word appears more than once in a document, that might convey some sort of information about the document over just the word occurring in the document or not. This approach is known as a bag-of-words model. To accommodate for this we need to slightly change the function setOfWords2Vec() and call it bagOfWords2VecMN().

def bagOfWords2VecMN(vocabList, inputSet):
    '''Naive Bayes bag-of-words model'''
    returnVec = [0] * len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec

Example: classifying spam email with naïve Bayes

Example: using naïve Bayes to classify email

  • Prepare: Parse text into token vectors.
  • Analyze: Inspect the tokens to make sure parsing was done correctly.
  • Train: Use trainNB0() that we created earlier.
  • Test: Use classifyNB() and create a new testing function to calculate the error rate over a set of documents.
  • Use: Build a complete program that will classify a group of documents and print misclassified documents to the screen.

Prepare: tokenizing text

  • split
  • remove punctuation
  • lower
  • filter out words with less than three characters

Test: cross validation with naïve Bayes

import re

# File parsing and full spam test functions

def textParse(bigString):
    listOfTokens = re.split(r'\W', bigString)
    return [token.lower() for token in listOfTokens if len(token) > 2]

def spamTest():
    # Prepare
    docList = []
    classList = []
    fullText = []
    for i in range(1,26):
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        wordList = textParse(open('email/ham/%s.txt' % i).read())
    vocabList = createVocabList(docList)
    # draw indexes of testSet
    trainingSet = range(50); testSet = []
    for i in range(10):
        randIndex = int(random.uniform(0, len(trainingSet)))
    # Train
    trainMat=[]; trainClasses=[]
    for docIndex in trainingSet:
        trainMat.append(setOfWords2Vec(vocabList, docList[docIndex]))
    p0V, p1V, pSpam = trainNB0(array(trainMat), array(trainClasses))
    # Test
    errorCount = 0
    for docIndex in testSet:
        wordVector = setOfWords2Vec(vocabList, docList[docIndex])
        if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
            errorCount += 1
    print 'the error is: ', float(errorCount) / len(testSet)

Since test emails are randomly selected, the results may be different each time. If there’s an error, it will display the word list for that document to give you an idea of what was misclassified . To get a good estimate of the error rate, you should repeat this procedure multiple times , say 10, and average the results. I did that and got an average error rate of 6%.

The error that keeps appearing is a piece of spam that was misclassified as ham. It’s better that a piece of spam sneaks through the filter than a valid email getting shoved into the spam folder . There are ways to bias the classifier to not make these errors, and we’ll talk about these in chapter 7.

The next example will show how to interpret the knowledge acquired from training the naïve Bayes classifier.

Example: using naïve Bayes to reveal local attitudes from personal ads

There are a number of other uses for classification. I’ve seen someone take the naïve Bayes classifier and train it with social network profiles of women he liked and women he didn’t like and then use the classifier to test how he would like an unknown person . The range of possibilities is limited only by your imagination. It’s been shown that the older someone is, the better their vocabulary becomes. Could we guess a person’s age by the words they use? Could we guess other factors about the person? Advertisers would love to know specific demographics about a person to better target the products they promote. Where would you get such training material? The internet abounds with training material. Almost every imaginable niche has a dedicated community where people have identified themselves as belonging to that community. The Dalmatian owners’ site used in section 4.3.1 is a great example.

In this last example, we’ll take some data from personals ads from multiple people for two different cities in the United States. We’re going to see if people in different cities use different words. If they do, what are the words they use? Can the words people use give us some idea what’s important to people in different cities?

Example: using naïve Bayes to find locally used words

  • Use: We’ll build a complete program to wrap everything together. It will display the most common words given in two RSS feeds.

TODO: 提取 most common words 不是简单计数就好了,需要用 bayes 吗?好像其实是一样的,conditional probability 本身就是分类下的计数。

We’re going to use the city that each ad comes from to train a classifier and then see how well it does. Finally, we’re not going to use this to classify anything. We’re going to look at the words and conditional probability scores to see if we can learn anything specific to one city over another.

# RSS feed classifiler and frequent word removal functions

def calMostFreq(vocabList, fullText):
    # Calculates frequency of occurrence
    import operator
    freqDict = {}
    for token in vocabList:
        freqDict[token] = fullText.count(token)
    sortedFreq = sorted(freqDict.iteritems(), key=operator.itemgetter(1),
    return sortedFreq[:30]

def localWords(feed1, feed0):
    import feedparser
    docList=[]; classList=[]; fullText=[]
    minLen = min(len(feed1['entries']), len(feed0['entries']))
    for i in range(minLen):
        wordList = textParse(feed1['entries'][i]['summary'])
        wordList = textParse(feed0['entries'][i]['summary'])
    # Remove most frequently occurring words
    # TODO 怎么确定需要移除的 topN 的数量?
    vocabList = createVocabList(docList)
    top30Words = calcMostFreq(vocabList, fullText) # vocabList 在这里只是起了一个 filter 的作用,而且它本身是从 fullText(sum of docList) 来的。为什么不直接对 fullText 计算 mostCommon,而要绕这么一圈?可能考虑到在一般应用中,vocabList 并不是 fullText 的集合.
    for pairW in top30Words:
        if pairW[0] in vocabList:
    # draw testSet
    trainingSet = range(2*minLen); testSet=[]
    for i in range(20):
        randIndex = int(random.uniform(0, len(trainingSet)))
    # Train
    trainMat=[]; trainClasses=[]
    for docIndex in trainingSet:
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
    p0V, p1V, pFeed1 = trainNB0(array(trainMat), array(trainClasses))
    # Test
    errorCount = 0
    for docIndex in testSet:
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(array(wordVector), p0V, p1V, pSpam) != \
            errorCount += 1
    print 'the error rate is: ', float(errorCount) / len(testSet)
    return vocabList, p0V, p1V

You can comment out the three lines that removed the most frequently used words and see the performance before and after. When I did this,I had an error rate of 54% without these lines and 70% with the lines included . An interesting observation is that the top 30 words in these posts make up close to 30% of all the words used . The size of the vocabList was ~3000 words when I was testing this. A small percentage of the total words makes up a large portion of the text. The reason for this is that a large percentage of language is redundancy and structural glue. Another common approach is to not just remove the most common words but to also remove this structural glue from a predefined list . This is known as a stop word list , and there are a number of sources of this available. (At the time of writing, has a good list of stop words in multiple languages.)

See in action:

>>> vocabList,pSF,pNY=bayes.localWords(ny,sf)
the error rate is:  0.1
>>> vocabList,pSF,pNY=bayes.localWords(ny,sf)
the error rate is:  0.35

Analyze: displaying locally used words

You can sort the vectors pSF and pNY and then print out the words from vocabList at the same index.

# Most descriptive word display function

def getTopWords(ny, sf):
    import operator
    vocabList, p0V, p1V = localWords(ny, sf)
    topNY=[], topSF=[]
    for i in range(len(p0V)):
        # TODO 这里为什么是负数?log 以后变成负数了
        # TODO 没有直接取 topN,而是用了一个概率的 threshold,考虑是什么?
        if p0V[i] > -6.0:
            topSF.append((vocabList[i], p0V[i]))
        if p1V[i] > -6.0:
            topNY.append((vocabList[i], p1V[i]))
    sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True)
    print "SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**"
    for item in sortedSF:
        print item[0]
    sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True)
    print "NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY **"
    for item in sortedNY:
        print item[0]

The words from this output are entertaining. One thing to note: a lot of stop words appear in the output. It would be interesting to see how things would change if you removed the fixed stop words. In my experience, the classification error will also go down.


Using probabilities can sometimes be more effective than using hard rules for classification. Bayesian probability and Bayes’ rule gives us a way to estimate unknown probabilities from known values.

You can reduce the need for a lot of data by assuming conditional independence among the features in your data . The assumption we make is that the probability of one word doesn’t depend on any other words in the document. We know this assumption is a little simple. That’s why it’s known as naïve Bayes. Despite its incorrect assumptions, naïve Bayes is effective at classification.

There are a number of practical considerations when implementing naïve Bayes in a modern programming language. Underflow is one problem that can be addressed by using the logarithm of probabilities in your calculations. The bag-of-words model is an improvement on the set-of-words model when approaching document classification. There are a number of other improvements, such as removing stop words , and you can spend a long time optimizing a tokenizer .

The probability theory you learned in this chapter will be used again later in the book, and this chapter was a great introduction to the full power of Bayesian probability theory. We’re going to take a break from probability theory. You’ll next see a classification method called logistic regression and some optimization algorithms.

Chapter 5 Logistic regression

This chapter covers

  • The sigmoid function and the logistic regression classifier
  • Our first look at optimization
  • The gradient descent optimization algorithm
  • Dealing with missing values in the our data

This is an exciting chapter because this is the first chapter where we encounter optimization algorithms . If you think about it, many of the things we do in life are optimization problems . Some examples of optimization from daily life are these: How do we get from point A to point B in the least amount of time? How do we make the most money doing the least amount of work? How do we design an engine to produce the most horsepower while using the least amount of fuel? We’ll use optimization algorithms to find these best-fit parameters . This best-fit stuff is where the name regression comes from.

We’ll look at a modified version called stochastic gradient ascent . These optimization algorithms will be used to train our classifier.

Classification with logistic regression and the sigmoid function: a tractable step function

Logistic regression

  • Pros: Computationally inexpensive, easy to implement, knowledge representation easy to interpret
  • Cons: Prone to underfitting, may have low accuracy Works with: Numeric values, nominal values 容易欠拟合
  • Works with: Numeric values, nominal values

We’d like to have an equation we can give all of our features and it will predict the class. In the two-class case, the function will spit out a 0 or a 1. Perhaps you’ve seen this before; it’s called the Heaviside step function , or sometimes just the step function. The problem with the Heaviside step function is that at the point where it steps from 0 to 1, it does so instantly. This instantaneous step is sometimes difficult to deal with. There’s another function that behaves in a similar fashion, but it’s much easier to deal with mathematically. This function is called the sigmoid. The sigmoid is given by the following equation:


  • Test: 为了量化回归的效果,需要观察错误率。 根据错误率决定是否回退到训练阶段,通过改变迭代次数和步长等参数来得到更好的回归系数。

Prepare: 处理数据中的缺失值


这里选择用 0 替换所有的缺失值,这样做的直觉在于,我们需要的是一个 在更新时不会影响系数的值 。回归系数的公式如下:

\[weights = weights + alpha * error * dataMatrix[randIndex]\]

如果 dataMatrix 的某特征值对应为 0,那么该特征的系数将不做更新,即 \(weights = weights\)

另外,由于 sigmoid(0) = 0.5 ,即它对结果的预测不具有任何倾向性,因此上述做法也不会对误差项造成任何影响。


数据很贵,到处都有 robust 和错误容忍的需求。

问题1. 前面回归系数的公式是 \(w := w + \alpha \Delta_w f(w)\) ,偏微分怎么和这里的 \(error * dataMatrix\) 对应起来?

问题2: 为什么有缺省值时,要分开考虑对特殊系数和误差项的影响。