欧美一卡2卡三卡4卡乱码,久久综合色一综合色88

99999久久久久久亚洲,欧美人与禽猛交狂配,高清日韩av在线影院,一个人在线高清免费观看,啦啦啦在线视频免费观看www

機(jī)器學(xué)習(xí)之Logistic回歸與Python實現(xiàn)

2017-07-24

機(jī)器學(xué)習(xí)之Logistic回歸與Python實現(xiàn)

logistic回歸是一種廣義的線性回歸，通過構(gòu)造回歸函數(shù)，利用機(jī)器學(xué)習(xí)來實現(xiàn)分類或者預(yù)測。

一 Logistic回歸概述

Logistic回歸的主要思想是，根據(jù)現(xiàn)有的數(shù)據(jù)對分類邊界建立回歸公式，從而實現(xiàn)分類（一般兩類）?！盎貧w”的意思就是要找到最佳擬合參數(shù)，其中涉及的數(shù)學(xué)原理和步驟如下：
（1）需要一個合適的分類函數(shù)來實現(xiàn)分類【單位階躍函數(shù)、Sigmoid函數(shù)】
（2）損失函數(shù)（Cost函數(shù)）來表示預(yù)測值（h(x)）與實際值(y)的偏差(h?y),要使得回歸最佳擬合，那么偏差要盡可能?。?a href='/map/piancha/' style='color:#000;font-size:inherit;'>偏差求和或取均值）。
（3）記J(ω)表示回歸系數(shù)為ω時的偏差，那么求最佳回歸參數(shù)ω就轉(zhuǎn)換成了求J(ω)的最小值。【梯度下降法】
所以，接下來就圍繞這幾個步驟進(jìn)行展開。

1.1 分類函數(shù)

假設(shè)要實現(xiàn)二分類，那么可以找一個函數(shù)，根據(jù)不同的特征變量，輸出0和1，并且只輸出0和1，這種函數(shù)在某個點直接從0跳躍到1，如：

但是這種函數(shù)處理起來，稍微有點麻煩，我們選擇另外一個連續(xù)可導(dǎo)的函數(shù)，也就是Sigmoid函數(shù),函數(shù)的公式如下：

這個函數(shù)的特點是，當(dāng)x=0.5時，h(x)=0.5，而x越大，h(x)越接近1，x越小，h(x)越接近0。函數(shù)圖如下：

這個函數(shù)很像階躍函數(shù)，當(dāng)x>0.5，就可以將數(shù)據(jù)分入1類；當(dāng)x<0.5，就可以將數(shù)據(jù)分入0類。

確定了分類函數(shù)，接下來，我們將Sigmoid函數(shù)的輸入記為z，那么

向量x是特征變量，是輸入數(shù)據(jù)，向量w是回歸系數(shù)是特征
之后的事情就是如何確定最佳回歸系數(shù)ω(w0,w1,w2,...,wn)

1.2 Cost函數(shù)

現(xiàn)有

對于任意確定的x和w,有：

這個函數(shù)可以寫成：

取似然函數(shù)：

求對數(shù)似然函數(shù)：

因此，就構(gòu)造得到了函數(shù)J(w)來表示預(yù)測值與實際值的偏差，于是Cost函數(shù)也可以寫成：

所以，我們可以用J(w)來表示預(yù)測值與實際值的偏差，也就是Cost函數(shù)，接下里的任務(wù)，就是如何讓偏差最小，也就是J(w)最大

Question:為什么J(w)可以表示預(yù)測值與實際值的大小，為什么J(w)最大表示偏差最小。

我們回到J(w)的推導(dǎo)來源，來自
P(y=1|x,w)=hw(x)和P(y=0|x,w)=1?hw(x)，
那么顯然有
當(dāng)x>0,此時y=1，1/2<hw(x)<1，所以P(y=1|x,w)=hw(x)>1/2
當(dāng)x<0,此時y=0，0<hw(x)<1/2，所以P(y=0|x,w)=1?hw(x)>1/2
所以，無論如何取值，P(y=0|x,w)都大于等于1/2,P(y=0|x,w)越大，越接近1，表示落入某分類概率越大，那么分類越準(zhǔn)確，預(yù)測值與實際值差異就越小。
所以P(y=0|x,w)可以表示預(yù)測值與實際值的差異，且P(y=0|x,w)越大表示差異越小，所以其似然函數(shù)J(w)越大，預(yù)測越準(zhǔn)確。

所以，接下來的任務(wù)，是如何求解J(w)最大時的w值，其方法是梯度上升法。

1.3 梯度上升法求J(w)最大值

梯度上升法的核心思想是：要找某個函數(shù)的最大值，就沿著這個函數(shù)梯度方向探尋，如果梯度記為?，那么函數(shù)f(x,y)的梯度是：

梯度上升法中，梯度算子沿著函數(shù)增長最快的方向移動（移動方向），如果移動大小為α（步長），那么梯度上升法的迭代公式是：

問題轉(zhuǎn)化成：

首先，我們對J(w)求偏導(dǎo)：

在第四至第五行的轉(zhuǎn)換，用到的公式是：

將求得的偏導(dǎo)公式代入梯度上升法迭代公示：

可以看到，式子中所有函數(shù)和輸入的值，都是已知的了。接下來，可以通過Python實現(xiàn)Logistic回歸了。

二、Python算法實現(xiàn)

2.1 梯度上升法求最佳回歸系數(shù)

首先，數(shù)據(jù)取自《機(jī)器學(xué)習(xí)實戰(zhàn)》中的數(shù)據(jù)，部分?jǐn)?shù)據(jù)如下：

-0.017612   14.053064   0
-1.395634   4.662541    1
-0.752157   6.538620    0
-1.322371   7.152853    0
0.423363    11.054677   0
0.406704    7.067335    1

先定義函數(shù)來獲取數(shù)去，然后定義分類函數(shù)Sigmoid函數(shù)，最后利用梯度上升法求解回歸系數(shù)w。
建立一個logRegres.py文件，輸入如下代碼：

from numpy import *
#構(gòu)造函數(shù)來獲取數(shù)據(jù)
def loadDataSet():
    dataMat=[];labelMat=[]
    fr=open('machinelearninginaction/Ch05/testSet.txt')
    for line in fr.readlines():
        lineArr=line.strip().split()
        dataMat.append([1.0,float(lineArr[0]),float(lineArr[1])])#特征數(shù)據(jù)集，添加1是構(gòu)造常數(shù)項x0
        labelMat.append(int(lineArr[-1]))#分類數(shù)據(jù)集
    return dataMat,labelMat

def sigmoid(inX):
    return 1/(1+exp(-inX))

def gradAscent(dataMatIn,classLabels):
    dataMatrix=mat(dataMatIn) #(m,n)
    labelMat=mat(classLabels).transpose() #轉(zhuǎn)置后(m,1)
    m,n=shape(dataMatrix)
    weights=ones((n,1)) #初始化回歸系數(shù)，(n,1)
    alpha=0.001 #定義步長
    maxCycles=500 #定義最大循環(huán)次數(shù)
    for i in range(maxCycles):
        h=sigmoid(dataMatrix * weights) #sigmoid 函數(shù)
        error=labelMat - h #即y-h，（m,1）
        weights=weights + alpha * dataMatrix.transpose() * error #梯度上升法
    return weights

在python命令符中輸入代碼對函數(shù)進(jìn)行測試：
In [8]: import logRegres
   ...:
In [9]: dataArr,labelMat=logRegres.loadDataSet()
   ...:
In [10]: logRegres.gradAscent(dataArr,labelMat)
    ...:
Out[10]:
matrix([[ 4.12414349],
        [ 0.48007329],
        [-0.6168482 ]])

于是得到了回歸系數(shù)。接下來根據(jù)回歸系數(shù)畫出決策邊界wTx=0
定義作圖函數(shù)：

def plotBestFit(weights):
    import matplotlib.pyplot as plt
    dataMat,labelMat=loadDataSet()
    n=shape(dataMat)[0]
    xcord1=[];ycord1=[]
    xcord2=[];ycord2=[]
    for i in range(n):
        if labelMat[i]==1:
            xcord1.append(dataMat[i][1])
            ycord1.append(dataMat[i][2])
        else:
            xcord2.append(dataMat[i][1])
            ycord2.append(dataMat[i][2])
    fig=plt.figure()
    ax=fig.add_subplot(111)
    ax.scatter(xcord1,ycord1,s=30,c='red',marker='s')
    ax.scatter(xcord2,ycord2,s=30,c='green')
    x=arange(-3,3,0.1)
    y=(-weights[0,0]-weights[1,0]*x)/weights[2,0] #matix
    ax.plot(x,y)
    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.show()

在Python的shell中對函數(shù)進(jìn)行測試：

In [11]: weights=logRegres.gradAscent(dataArr,labelMat)

In [12]: logRegres.plotBestFit(weights)
...:

2.2 算法改進(jìn)

(1) 隨機(jī)梯度上升
上述算法，要進(jìn)行maxCycles次循環(huán)，每次循環(huán)中矩陣會有m*n次乘法計算，所以時間復(fù)雜度（開銷）是maxCycles*m*n，當(dāng)數(shù)據(jù)量較大時，時間復(fù)雜度就會很大。因此，可以是用隨機(jī)梯度上升法來進(jìn)行算法改進(jìn)。

隨機(jī)梯度上升法的思想是，每次只使用一個數(shù)據(jù)樣本點來更新回歸系數(shù)。這樣就大大減小計算開銷。
代碼如下：

def stocGradAscent(dataMatrix,classLabels):
    m,n=shape(dataMatrix)
    alpha=0.01
    weights=ones(n)
    for i in range(m):
        h=sigmoid(sum(dataMatrix[i] * weights))#數(shù)值計算
        error = classLabels[i]-h
        weights=weights + alpha * error * dataMatrix[i] #array 和list矩陣乘法不一樣
    return weights

注意：gradAscent函數(shù)和這個stocGradAscent函數(shù)中的h和weights的計算形式不一樣，因為
前者是的矩陣的計算，類型是numpy的matrix，按照矩陣的運算規(guī)則進(jìn)行計算。
后者是數(shù)值計算，其類型是list，按照數(shù)值運算規(guī)則計算。

對隨機(jī)梯度上升算法進(jìn)行測試：

In [37]: dataMat,labelMat=logRegres.loadDataSet()
    ...:
In [38]: weights=logRegres.stocGradAscent(array(dataMat),labelMat)
    ...:
In [39]: logRegres.plotBestFit(mat(weights).transpose())
    ...:

輸出的樣本數(shù)據(jù)點和決策邊界是：

（2）改進(jìn)的隨機(jī)梯度上升法

def stocGradAscent1(dataMatrix,classLabels,numIter=150):
    m,n=shape(dataMatrix)
    weights=ones(n)
    for j in range(numIter):
        dataIndex=list(range(m))
        for i in range(m):
            alpha=4/(1+i+j)+0.01#保證多次迭代后新數(shù)據(jù)仍然具有一定影響力
            randIndex=int(random.uniform(0,len(dataIndex)))#減少周期波動
            h=sigmoid(sum(dataMatrix[randIndex] * weights))
            error=classLabels[randIndex]-h
            weights=weights + alpha*dataMatrix[randIndex]*error
            del(dataIndex[randIndex])
    return weights

在Python命令符中測試函數(shù)并畫出分類邊界：

In [188]: weights=logRegres.stocGradAscent1(array(dataMat),labelMat)
...:
In [189]: logRegres.plotBestFit(mat(weights).transpose())
...:

（3）三種方式回歸系數(shù)波動情況
普通的梯度上升法：

隨機(jī)梯度上升：

改進(jìn)的隨機(jī)梯度上升

評價算法優(yōu)劣勢看它是或否收斂，是否達(dá)到穩(wěn)定值，收斂越快，算法越優(yōu)。

三實例

3.1 通過logistic回歸和氙氣病癥預(yù)測馬的死亡率

數(shù)據(jù)取自《機(jī)器學(xué)習(xí)實戰(zhàn)》一書中的氙氣病癥與馬死亡的數(shù)據(jù)，部分?jǐn)?shù)據(jù)如下：

2.000000    1.000000    38.500000   66.000000   28.000000   3.000000    3.000000    0.000000    2.000000    5.000000    4.000000    4.000000    0.000000    0.000000    0.000000    3.000000    5.000000    45.000000   8.400000    0.000000    0.000000    0.000000
1.000000    1.000000    39.200000   88.000000   20.000000   0.000000    0.000000    4.000000    1.000000    3.000000    4.000000    2.000000    0.000000    0.000000    0.000000    4.000000    2.000000    50.000000   85.000000   2.000000    2.000000    0.000000
2.000000    1.000000    38.300000   40.000000   24.000000   1.000000    1.000000    3.000000    1.000000    3.000000    3.000000    1.000000    0.000000    0.000000    0.000000    1.000000    1.000000    33.000000   6.700000    0.000000    0.000000    1.000000

通過21個特征數(shù)據(jù)，來對結(jié)果進(jìn)行分類和預(yù)測。

#定義分類函數(shù)，prob>0.5，則分入1，否則分類0
def classifyVector(inX,trainWeights):
    prob=sigmoid(sum(inX*trainWeights))
    if prob>0.5:return 1
    else : return 0

def colicTest():
    frTrain = open('machinelearninginaction/Ch05/horseColicTraining.txt')#訓(xùn)練數(shù)據(jù)
    frTest = open('machinelearninginaction/Ch05/horseColicTest.txt')#測試數(shù)據(jù)
    trainSet=[];trainLabels=[]
    for line in frTrain.readlines():
        currLine=line.strip().split('\t')
        lineArr=[]
        for i in range(21):
            lineArr.append(float(currLine[i]))
        trainSet.append(lineArr)
        trainLabels.append(float(currLine[21]))
    trainWeights=stocGradAscent1(array(trainSet),trainLabels,500)#改進(jìn)的隨機(jī)梯度上升法
    errorCount=0;numTestVec=0
    for line in frTest.readlines():
        numTestVec+=1
        currLine=line.strip().split('\t')
        lineArr=[]
        for i in range(21):
            lineArr.append(float(currLine[i]))
        if classifyVector(array(lineArr),trainWeights)!=int(currLine[21]):
            errorCount+=1
    errorRate=(float(errorCount)/numTestVec)
    print('the error rate of this test is :%f'%errorRate)
    return errorRate
def multiTest():#進(jìn)行多次測試
    numTests=10;errorSum=0
    for k in range(numTests):
        errorSum+=colicTest()
    print('after %d iterations the average error rate is:%f'%(numTests,errorSum/float(numTests)))

在控制臺命令符中輸入命令來對函數(shù)進(jìn)行測試：

In [3]: logRegres.multiTest()
G:\Workspaces\MachineLearning\logRegres.py:19: RuntimeWarning: overflow encountered in exp
return 1/(1+exp(-inX))
the error rate of this test is :0.313433
the error rate of this test is :0.268657
the error rate of this test is :0.358209
the error rate of this test is :0.447761
the error rate of this test is :0.298507
the error rate of this test is :0.373134
the error rate of this test is :0.358209
the error rate of this test is :0.417910
the error rate of this test is :0.432836
the error rate of this test is :0.417910
after 10 iterations the average error rate is:0.368657

分類的錯誤率是36.9%。

CDA數(shù)據(jù)分析師考試相關(guān)入口一覽（建議收藏）：

? 想報名CDA認(rèn)證考試，點擊>>> “CDA報名” 了解CDA考試詳情；