孕交videosgratis孕妇性欧美,国产午夜亚洲精品国产成人,日本免费人成视频播放

99999久久久久久亚洲,欧美人与禽猛交狂配,高清日韩av在线影院,一个人在线高清免费观看,啦啦啦在线视频免费观看www

機(jī)器學(xué)習(xí)之決策樹(ID3)算法與Python實(shí)現(xiàn)

2017-07-23

機(jī)器學(xué)習(xí)之決策樹(ID3)算法與Python實(shí)現(xiàn)

機(jī)器學(xué)習(xí)中，決策樹是一個(gè)預(yù)測(cè)模型；他代表的是對(duì)象屬性與對(duì)象值之間的一種映射關(guān)系。樹中每個(gè)節(jié)點(diǎn)表示某個(gè)對(duì)象，而每個(gè)分叉路徑則代表的某個(gè)可能的屬性值，而每個(gè)葉結(jié)點(diǎn)則對(duì)應(yīng)從根節(jié)點(diǎn)到該葉節(jié)點(diǎn)所經(jīng)歷的路徑所表示的對(duì)象的值。決策樹僅有單一輸出，若欲有復(fù)數(shù)輸出，可以建立獨(dú)立的決策樹以處理不同輸出。數(shù)據(jù)挖掘中決策樹是一種經(jīng)常要用到的技術(shù)，可以用于分析數(shù)據(jù)，同樣也可以用來作預(yù)測(cè)。

一、決策樹與ID3概述1.決策樹

決策樹，其結(jié)構(gòu)和樹非常相似，因此得其名決策樹。決策樹具有樹形的結(jié)構(gòu)，其中每個(gè)內(nèi)部節(jié)點(diǎn)表示一個(gè)屬性上的測(cè)試，每個(gè)分支代表一個(gè)測(cè)試輸出，每個(gè)葉節(jié)點(diǎn)代表一種類別。

例如：

按照豆腐腦的冷熱、甜咸和是否含有大蒜構(gòu)建決策樹，對(duì)其屬性的測(cè)試，在最終的葉節(jié)點(diǎn)決定該豆腐腦吃還是不吃。

分類樹（決策樹）是一種十分常用的將決策樹應(yīng)用于分類的機(jī)器學(xué)習(xí)方法。他是一種監(jiān)管學(xué)習(xí)，所謂監(jiān)管學(xué)習(xí)就是給定一堆樣本，每個(gè)樣本都有一組屬性（特征）和一個(gè)類別（分類信息/目標(biāo)），這些類別是事先確定的，那么通過學(xué)習(xí)得到一個(gè)分類器，這個(gè)分類器能夠?qū)π鲁霈F(xiàn)的對(duì)象給出正確的分類。
其原理在于，每個(gè)決策樹都表述了一種樹型結(jié)構(gòu)，它由它的分支來對(duì)該類型的對(duì)象依靠屬性進(jìn)行分類。每個(gè)決策樹可以依靠對(duì)源數(shù)據(jù)庫(kù)的分割進(jìn)行數(shù)據(jù)測(cè)試。這個(gè)過程可以遞歸式的對(duì)樹進(jìn)行修剪。當(dāng)不能再進(jìn)行分割或一個(gè)單獨(dú)的類可以被應(yīng)用于某一分支時(shí)，遞歸過程就完成了。

目前常用的決策樹算法有ID3算法、改進(jìn)的C4.5算法和CART算法。

決策樹的特點(diǎn)

1.多層次的決策樹形式易于理解；

2.只適用于標(biāo)稱型數(shù)據(jù)，對(duì)連續(xù)性數(shù)據(jù)處理得不好；

2、ID3算法

ID3算法最早是由羅斯昆（J. Ross Quinlan）于1975年在悉尼大學(xué)提出的一種分類預(yù)測(cè)算法，算法以信息論為基礎(chǔ)，其核心是“信息熵”。ID3算法通過計(jì)算每個(gè)屬性的信息增益，認(rèn)為信息增益高的是好屬性，每次劃分選取信息增益最高的屬性為劃分標(biāo)準(zhǔn)，重復(fù)這個(gè)過程，直至生成一個(gè)能完美分類訓(xùn)練樣例的決策樹。

信息熵（Entropy）:

,其中p(xi)是選擇i的概率。
熵越高，表示混合的數(shù)據(jù)越多。
信息增益（Information Gain）:

T是劃分之后的分支集合，p(t)是該分支集合在原本的父集合中出現(xiàn)的概率，H(t)是該子集合的信息熵。
3.ID3算法與決策樹的流程
（1）數(shù)據(jù)準(zhǔn)備：需要對(duì)數(shù)值型數(shù)據(jù)進(jìn)行離散化
（2）ID3算法構(gòu)建決策樹：
如果數(shù)據(jù)集類別完全相同，則停止劃分
否則，繼續(xù)劃分決策樹：
計(jì)算信息熵和信息增益來選擇最好的數(shù)據(jù)集劃分方法；
劃分?jǐn)?shù)據(jù)集
創(chuàng)建分支節(jié)點(diǎn)：
對(duì)每個(gè)分支進(jìn)行判定是否類別相同，如果相同停止劃分，不同按照上述方法進(jìn)行劃分。
二、Python算法實(shí)現(xiàn)

創(chuàng)建 trees.py文件，在其中創(chuàng)建構(gòu)建決策樹的函數(shù)。
首先構(gòu)建一組測(cè)試數(shù)據(jù)：

0. 構(gòu)造函數(shù)createDataSet：

def createDataSet():
    dataSet=[[1,1,'yes'],[1,1,'yes'],[1,0,'no'],[0,1,'no'],[0,1,'no']]
    labels=['no surfacing','flippers']
    return dataSet,labels

在Python控制臺(tái)測(cè)試構(gòu)造函數(shù)

#測(cè)試下構(gòu)造的數(shù)據(jù)
import trees
myDat,labels = trees.createDataSet()
myDat
Out[4]: [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
labels

Out[5]: ['no surfacing', 'flippers']

2.1 計(jì)算信息熵

from math import log

def calcShannonEnt(dataSet):
    numEntries = len(dataSet) #nrows
    #為所有的分類類目創(chuàng)建字典
    labelCounts ={}
    for featVec in dataSet:
        currentLable=featVec[-1] #取得最后一列數(shù)據(jù)
        if currentLable not in labelCounts.keys():
            labelCounts[currentLable]=0
        labelCounts[currentLable]+=1
    #計(jì)算香農(nóng)熵
    shannonEnt=0.0
    for key in labelCounts:
        prob = float(labelCounts[key]) / numEntries
        shannonEnt -= prob * log(prob, 2)
    return shannonEnt

利用構(gòu)造的數(shù)據(jù)測(cè)試calcShannonEnt：

#Python console
In [6]: trees.calcShannonEnt(myDat)
...:
Out[6]: 0.9709505944546686

2.2 按照最大信息增益劃分?jǐn)?shù)據(jù)集

#定義按照某個(gè)特征進(jìn)行劃分的函數(shù)splitDataSet
#輸入三個(gè)變量（待劃分的數(shù)據(jù)集，特征，分類值）
def splitDataSet(dataSet,axis,value):
    retDataSet=[]
    for featVec in dataSet:
        if featVec[axis]==value :
            reduceFeatVec=featVec[:axis]
            reduceFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reduceFeatVec)
    return retDataSet #返回不含劃分特征的子集

#定義按照最大信息增益劃分?jǐn)?shù)據(jù)的函數(shù)
def chooseBestFeatureToSplit(dataSet):
    numFeature=len(dataSet[0])-1
    baseEntropy=calcShannonEnt(dataSet)#香農(nóng)熵
    bestInforGain=0
    bestFeature=-1
    for i in range(numFeature):
        featList=[number[i] for number in dataSet] #得到某個(gè)特征下所有值（某列）
        uniqualVals=set(featList) #set無(wú)重復(fù)的屬性特征值
        newEntropy=0
        for value in uniqualVals:
            subDataSet=splitDataSet(dataSet,i,value)
            prob=len(subDataSet)/float(len(dataSet)) #即p(t)
            newEntropy+=prob*calcShannonEnt(subDataSet)#對(duì)各子集香農(nóng)熵求和
        infoGain=baseEntropy-newEntropy #計(jì)算信息增益
        #最大信息增益
        if (infoGain>bestInforGain):
            bestInforGain=infoGain
            bestFeature=i
    return bestFeature #返回特征值

在控制臺(tái)中測(cè)試這兩個(gè)函數(shù)：

#測(cè)試按照特征劃分?jǐn)?shù)據(jù)集的函數(shù)
In [8]: from imp import reload
In [9]: reload(trees)
Out[9]: <module 'trees' from 'G:\\Workspaces\\MachineLearning\\trees.py'>
In [10]: myDat,labels=trees.createDataSet()
    ...:
In [11]: trees.splitDataSet(myDat,0,0)
    ...:
Out[11]: [[1, 'no'], [1, 'no']]
In [12]: trees.splitDataSet(myDat,0,1)
    ...:
Out[12]: [[1, 'yes'], [1, 'yes'], [0, 'no']]

#測(cè)試chooseBestFeatureToSplit函數(shù)
In [13]: reload(trees)
    ...:
Out[13]: <module 'trees' from 'G:\\Workspaces\\MachineLearning\\trees.py'>

In [14]: trees.chooseBestFeatureToSplit(myDat)
    ...:

Out[14]: 0

2.3 創(chuàng)建決策樹構(gòu)造函數(shù)createTree

import operater
#投票表決代碼
def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys():classCount[vote]=0
        classCount[vote]+=1
    sortedClassCount=sorted(classCount.items,key=operator.itemgetter(1),reversed=True)
    return sortedClassCount[0][0]

def createTree(dataSet,labels):
    classList=[example[-1] for example in dataSet]
    #類別相同，停止劃分
    if classList.count(classList[-1])==len(classList):
        return classList[-1]
    #長(zhǎng)度為1，返回出現(xiàn)次數(shù)最多的類別
    if len(classList[0])==1:
        return majorityCnt(classList)
    #按照信息增益最高選取分類特征屬性
    bestFeat=chooseBestFeatureToSplit(dataSet)#返回分類的特征序號(hào)
    bestFeatLable=labels[bestFeat] #該特征的label
    myTree={bestFeatLable:{}} #構(gòu)建樹的字典
    del(labels[bestFeat]) #從labels的list中刪除該label
    featValues=[example[bestFeat] for example in dataSet]
    uniqueVals=set(featValues)
    for value in uniqueVals:
        subLables=labels[:] #子集合
        #構(gòu)建數(shù)據(jù)的子集合，并進(jìn)行遞歸
        myTree[bestFeatLable][value]=createTree(splitDataSet(dataSet,bestFeat,value),subLables)
    return myTree

以之前構(gòu)造的測(cè)試數(shù)據(jù)為例，對(duì)決策樹構(gòu)造函數(shù)進(jìn)行測(cè)試，在python控制臺(tái)進(jìn)行輸入：

#決策樹構(gòu)造函數(shù)測(cè)試
In [15]: reload(trees)
...:
Out[15]: <module 'trees' from 'G:\\Workspaces\\MachineLearning\\trees.py'>

In [16]: myTree=trees.createTree(myDat,labels)
...:

In [17]: myTree
Out[17]: {'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

可以看到，最后生成的決策樹myTree是一個(gè)多層嵌套的字典。

2.4 決策樹運(yùn)用于分類

運(yùn)用決策樹進(jìn)行分類，首先構(gòu)建一個(gè)決策樹分類函數(shù)：

#輸入三個(gè)變量（決策樹，屬性特征標(biāo)簽，測(cè)試的數(shù)據(jù)）
def classify(inputTree,featLables,testVec):
    firstStr=list(inputTree.keys())[0] #獲取樹的第一個(gè)特征屬性
    secondDict=inputTree[firstStr] #樹的分支，子集合Dict
    featIndex=featLables.index(firstStr) #獲取決策樹第一層在featLables中的位置
    for key in secondDict.keys():
        if testVec[featIndex]==key:
            if type(secondDict[key]).__name__=='dict':
                classLabel=classify(secondDict[key],featLables,testVec)
            else:classLabel=secondDict[key]
    return classLabel

對(duì)決策樹分類函數(shù)進(jìn)行測(cè)試：

In [29]: reload(trees)
    ...:
Out[29]: <module 'trees' from 'G:\\Workspaces\\MachineLearning\\trees.py'>

In [30]: myDat,labels=trees.createDataSet()
    ...:

In [31]: labels
    ...:
Out[31]: ['no surfacing', 'flippers']

In [32]: myTree=treeplotter.retrieveTree(0)
    ...:

In [33]: myTree
    ...:
Out[33]: {'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

In [34]: trees.classify(myTree,labels,[1,0])
    ...:
Out[34]: 'no'

In [35]: trees.classify(myTree,labels,[1,1])
    ...:

Out[35]: 'yes'

2.5 決策樹的存儲(chǔ)

如果每次都需要訓(xùn)練樣本集來構(gòu)建決策樹，費(fèi)時(shí)費(fèi)力，特別是數(shù)據(jù)很大的時(shí)候，每次重新構(gòu)建決策樹浪費(fèi)時(shí)間。因此可以將已經(jīng)創(chuàng)建的決策樹（如字典形式）保存在硬盤上，需要使用的時(shí)候直接讀取就好。
（1）存儲(chǔ)函數(shù)

    1def storeTree(inputTree,filename):
    import pickle
    fw=open(filename,'wb') #pickle默認(rèn)方式是二進(jìn)制，需要制定'wb'
    pickle.dump(inputTree,fw)
    fw.close()

（2）讀取函數(shù)

def grabTree(filename):
    import pickle
    fr=open(filename,'rb')#需要制定'rb'，以byte形式讀取
    return pickle.load(fr)

對(duì)這兩個(gè)函數(shù)進(jìn)行測(cè)試（Python console）：

In [36]: myTree
Out[36]: {'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

In [37]: trees.storeTree(myTree,'classifierStorage.txt')

In [38]: trees.grabTree('classifierStorage.txt')
Out[38]: {'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

在工作目錄下存在一個(gè)名為’classifierStorage.txt’的txt文檔，該文檔保存了myTree的決策樹信息，需要使用的時(shí)候直接調(diào)出使用。

三、使用Matplotlib繪制決策樹

import matplotlib.pyplot as plt

from pylab import *
mpl.rcParams['font.sans-serif'] = ['SimHei'] #否則中文無(wú)法正常顯示

decisionNode=dict(boxstyle='sawtooth',fc='0.8') #決策點(diǎn)樣式
leafNode=dict(boxstyle='round4',fc='0.8')　#葉節(jié)點(diǎn)樣式
arrow_args=dict(arrowstyle='<-') #箭頭樣式

def plotNode(nodeTxt,centerPt,parentPt,nodeType):
    createPlot.ax1.annotate(nodeTxt,xy=parentPt,xycoords='axes fraction',
                            xytext=centerPt,textcoords='axes fraction',
                            va='center',ha='center',bbox=nodeType,arrowprops=arrow_args)

def createPlot():
    fig=plt.figure(1,facecolor='white')
    fig.clf()
    createPlot.ax1=plt.subplot(111,frameon=False)
    plotNode('決策節(jié)點(diǎn)',(0.5,0.1),(0.1,0.5),decisionNode)
    plotNode('葉節(jié)點(diǎn)',(0.8,0.1),(0.3,0.8),leafNode)
    plt.show()

#測(cè)試
#獲取葉節(jié)點(diǎn)數(shù)量（廣度）
def getNumLeafs(myTree):
    numLeafs=0
    firstStr=list(myTree.keys())[0]#'dict_keys' object does not support indexing
    secondDict=myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':
            numLeafs+=getNumLeafs(secondDict[key])
        else:numLeafs+=1
    return numLeafs

#獲取樹的深度的函數(shù)（深度）
def getTreeDepth(myTree):
    maxDepth=0
    firstStr=list(myTree.keys())[0]
    secondDict=myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':
            thisDepth=1+getTreeDepth(secondDict[key])
        else: thisDepth=1
        if thisDepth > maxDepth:
            maxDepth=thisDepth
    return maxDepth
#定義一個(gè)預(yù)先創(chuàng)建樹的函數(shù)
def retrieveTree(i):
    listOfTrees=[{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}},
                 {'no surfacing': {0: 'no', 1: {'flippers': {0: {'head':{0:'no', 1: 'yes'}},1:'no'}}}}
                 ]
    return listOfTrees[i]

#定義在父子節(jié)點(diǎn)之間填充文本信息的函數(shù)
def plotMidText(cntrPt,parentPt,txtString):
    xMid=(parentPt[0]-cntrPt[0])/2+cntrPt[0]
    yMid=(parentPt[1]-cntrPt[1])/2+cntrPt[1]
    createPlot.ax1.text(xMid,yMid,txtString)

#定義樹繪制的函數(shù)
def plotTree(myTree,parentPt,nodeTxt):
    numLeafs=getNumLeafs(myTree)
    depth=getTreeDepth(myTree)
    firstStr=list(myTree.keys())[0]
    cntrPt=(plotTree.xOff+(1.0+float(numLeafs))/2/plotTree.totalW,plotTree.yOff)
    plotMidText(cntrPt,parentPt,nodeTxt)
    plotNode(firstStr,cntrPt,parentPt,decisionNode)
    secondDict=myTree[firstStr]
    plotTree.yOff=plotTree.yOff -1/plotTree.totalD
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':
            plotTree(secondDict[key],cntrPt,str(key))
        else:
            plotTree.xOff=plotTree.xOff+1.0/plotTree.totalW
            plotNode(secondDict[key],(plotTree.xOff,plotTree.yOff),cntrPt,leafNode)
            plotMidText((plotTree.xOff,plotTree.yOff),cntrPt,str(key))
    plotTree.yOff=plotTree.yOff+1/plotTree.totalD

#定義主函數(shù)，來調(diào)用其它函數(shù)
def createPlot(inTree):
    fig=plt.figure(1,facecolor='white')
    fig.clf()
    axprops=dict(xticks=[],yticks=[])
    createPlot.ax1=plt.subplot(111,frameon=False,**axprops)
    plotTree.totalW=float(getNumLeafs(inTree))
    plotTree.totalD=float(getTreeDepth(inTree))
    plotTree.xOff=-0.5/plotTree.totalW;plotTree.yOff=1.0;
    plotTree(inTree,(0.5,1.0),'')
    plt.show()

對(duì)繪制決策樹圖的函數(shù)進(jìn)行測(cè)試（控制臺(tái)）：

In [26]: reload(treeplotter)
    ...:
Out[26]: <module 'treeplotter' from 'G:\\Workspaces\\MachineLearning\\treeplotter.py'>

In [27]: myTree=treeplotter.retrieveTree(0)
    ...:

In [28]: treeplotter.createPlot(myTree)
    ...:

得到決策樹圖：

四、實(shí)例（使用決策樹預(yù)測(cè)隱形眼鏡類型）

隱形眼鏡的數(shù)據(jù)集包含了患者的四個(gè)屬性age，prescript，stigmatic，tearRate，利用這些數(shù)據(jù)構(gòu)建決策樹，并通過Matplotlib繪制出決策樹的樹狀圖。
附lenses.txt數(shù)據(jù)：

young   myope   no reduced no lenses
young   myope   no normal soft
young   myope   yes reduced no lenses
young   myope   yes normal hard
young   hyper   no reduced no lenses
young   hyper   no normal soft
young   hyper   yes reduced no lenses
young   hyper   yes normal hard
pre myope   no reduced no lenses
pre myope   no normal soft
pre myope   yes reduced no lenses
pre myope   yes normal hard
pre hyper   no reduced no lenses
pre hyper   no normal soft
pre hyper   yes reduced no lenses
pre hyper   yes normal no lenses
presbyopic myope   no reduced no lenses
presbyopic myope   no normal no lenses
presbyopic myope   yes reduced no lenses
presbyopic myope   yes normal hard
presbyopic hyper   no reduced no lenses
presbyopic hyper   no normal soft
presbyopic hyper   yes reduced no lenses
presbyopic hyper   yes normal no lenses

In [40]: fr=open('machinelearninginaction/Ch03/lenses.txt')

In [41]: lenses=[inst.strip().split('\t') for inst in fr.readlines()]

In [42]: lensesLabels=['age','prescript','astigmatic','tearRate']

In [43]: lensesTree=trees.createTree(lenses,lensesLabels)

In [44]: lensesTree
Out[44]:
{'tearRate': {'normal': {'astigmatic': {'no': {'age': {'pre': 'soft',
      'presbyopic': {'prescript': {'hyper': 'soft', 'myope': 'no lenses'}},
      'young': 'soft'}},
    'yes': {'prescript': {'hyper': {'age': {'pre': 'no lenses',
        'presbyopic': 'no lenses',
        'young': 'hard'}},
      'myope': 'hard'}}}},
'reduced': 'no lenses'}}
In [45]: treeplotter.createPlot(lensesTree)

得到圖