99999久久久久久亚洲,欧美人与禽猛交狂配,高清日韩av在线影院,一个人在线高清免费观看,啦啦啦在线视频免费观看www

熱線電話：13121318867

登錄

首頁(yè)精彩閱讀scikit-learn的線性回歸模型

scikit-learn的線性回歸模型

2016-05-05

scikit-learn的線性回歸模型

特征選擇的方法

作為有監(jiān)督學(xué)習(xí)，分類問(wèn)題是預(yù)測(cè)類別結(jié)果，而回歸問(wèn)題是預(yù)測(cè)一個(gè)連續(xù)的結(jié)果。

1. 使用pandas來(lái)讀取數(shù)據(jù)

Pandas是一個(gè)用于數(shù)據(jù)探索、數(shù)據(jù)處理、數(shù)據(jù)分析的Python庫(kù)

In [1]:

importpandasaspd

In [2]:

# read csv file directly from a URL and save the resultsdata=pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv',index_col=0)# display the first 5 rowsdata.head()

Out[2]:

	TV	Radio	Newspaper	Sales
1	230.1	37.8	69.2	22.1
2	44.5	39.3	45.1	10.4
3	17.2	45.9	69.3	9.3
4	151.5	41.3	58.5	18.5
5	180.8	10.8	58.4	12.9

上面顯示的結(jié)果類似一個(gè)電子表格，這個(gè)結(jié)構(gòu)稱為Pandas的數(shù)據(jù)幀(data frame)。

pandas的兩個(gè)主要數(shù)據(jù)結(jié)構(gòu)：Series和DataFrame：

Series類似于一維數(shù)組，它有一組數(shù)據(jù)以及一組與之相關(guān)的數(shù)據(jù)標(biāo)簽(即索引)組成。

DataFrame是一個(gè)表格型的數(shù)據(jù)結(jié)構(gòu)，它含有一組有序的列，每列可以是不同的值類型。DataFrame既有行索引也有列索引，它可以被看做由Series組成的字典。

In [3]:

# display the last 5 rowsdata.tail()

Out[3]:

	TV	Radio	Newspaper	Sales
196	38.2	3.7	13.8	7.6
197	94.2	4.9	8.1	9.7
198	177.0	9.3	6.4	12.8
199	283.6	42.0	66.2	25.5
200	232.1	8.6	8.7	13.4

In [4]:

# check the shape of the DataFrame(rows, colums)data.shape

Out[4]:

(200, 4)

特征：

TV：對(duì)于一個(gè)給定市場(chǎng)中單一產(chǎn)品，用于電視上的廣告費(fèi)用（以千為單位）

Radio：在廣播媒體上投資的廣告費(fèi)用

Newspaper：用于報(bào)紙媒體的廣告費(fèi)用

響應(yīng)：

Sales：對(duì)應(yīng)產(chǎn)品的銷量

在這個(gè)案例中，我們通過(guò)不同的廣告投入，預(yù)測(cè)產(chǎn)品銷量。因?yàn)轫憫?yīng)變量是一個(gè)連續(xù)的值，所以這個(gè)問(wèn)題是一個(gè)回歸問(wèn)題。數(shù)據(jù)集一共有200個(gè)觀測(cè)值，每一組觀測(cè)對(duì)應(yīng)一個(gè)市場(chǎng)的情況。

In [5]:

importseabornassns%matplotlibinline

In [6]:

# visualize the relationship between the features and the response using scatterplotssns.pairplot(data,x_vars=['TV','Radio','Newspaper'],y_vars='Sales',size=7,aspect=0.8)

Out[6]:

<seaborn.axisgrid.PairGrid at 0x82dd890>

seaborn的pairplot函數(shù)繪制X的每一維度和對(duì)應(yīng)Y的散點(diǎn)圖。通過(guò)設(shè)置size和aspect參數(shù)來(lái)調(diào)節(jié)顯示的大小和比例?？梢詮膱D中看出，TV特征和銷量是有比較強(qiáng)的線性關(guān)系的，而Radio和Sales線性關(guān)系弱一些，Newspaper和Sales線性關(guān)系更弱。通過(guò)加入一個(gè)參數(shù)kind=’reg’，seaborn可以添加一條最佳擬合直線和95%的置信帶。

In [7]:

sns.pairplot(data,x_vars=['TV','Radio','Newspaper'],y_vars='Sales',size=7,aspect=0.8,kind='reg')

Out[7]:

<seaborn.axisgrid.PairGrid at 0x83b76f0>

2. 線性回歸模型

優(yōu)點(diǎn)：快速；沒(méi)有調(diào)節(jié)參數(shù)；可輕易解釋；可理解

缺點(diǎn)：相比其他復(fù)雜一些的模型，其預(yù)測(cè)準(zhǔn)確率不是太高，因?yàn)樗僭O(shè)特征和響應(yīng)之間存在確定的線性關(guān)系，這種假設(shè)對(duì)于非線性的關(guān)系，線性回歸模型顯然不能很好的對(duì)這種數(shù)據(jù)建模。

線性模型表達(dá)式： y=β0+β1x1+β2x2+...+βnxn 其中

y是響應(yīng)

β0是截距

β1是x1的系數(shù)，以此類推

在這個(gè)案例中： y=β0+β1?TV+β2?Radio+...+βn?Newspaper

（1）使用pandas來(lái)構(gòu)建X和y

scikit-learn要求X是一個(gè)特征矩陣，y是一個(gè)NumPy向量

pandas構(gòu)建在NumPy之上

因此，X可以是pandas的DataFrame，y可以是pandas的Series，scikit-learn可以理解這種結(jié)構(gòu)

In [8]:

# create a python list of feature namesfeature_cols=['TV','Radio','Newspaper']# use the list to select a subset of the original DataFrameX=data[feature_cols]# equivalent command to do this in one lineX=data[['TV','Radio','Newspaper']]# print the first 5 rowsX.head()

Out[8]:

	TV	Radio	Newspaper
1	230.1	37.8	69.2
2	44.5	39.3	45.1
3	17.2	45.9	69.3
4	151.5	41.3	58.5
5	180.8	10.8	58.4

In [9]:

# check the type and shape of Xprinttype(X)printX.shape

<class 'pandas.core.frame.DataFrame'> (200, 3)

In [10]:

# select a Series from the DataFramey=data['Sales']# equivalent command that works if there are no spaces in the column namey=data.Sales# print the first 5 valuesy.head()

Out[10]:

1 22.1 2 10.4 3 9.3 4 18.5 5 12.9 Name: Sales, dtype: float64

In [11]:

printtype(y)printy.shape

<class 'pandas.core.series.Series'> (200,)

(2)構(gòu)造訓(xùn)練集和測(cè)試集

In [12]:

fromsklearn.cross_validationimporttrain_test_splitX_train,X_test,y_train,y_test=train_test_split(X,y,random_state=1)

In [14]:

# default split is 75% for training and 25% for testingprintX_train.shapeprinty_train.shapeprintX_test.shapeprinty_test.shape

(150, 3) (150,) (50, 3) (50,)

(3)Scikit-learn的線性回歸

In [15]:

fromsklearn.linear_modelimportLinearRegressionlinreg=LinearRegression()linreg.fit(X_train,y_train)

Out[15]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [16]:

printlinreg.intercept_printlinreg.coef_

2.87696662232 [ 0.04656457 0.17915812 0.00345046]

In [17]:

# pair the feature names with the coefficientszip(feature_cols,linreg.coef_)

Out[17]:

[('TV', 0.046564567874150253), ('Radio', 0.17915812245088836), ('Newspaper', 0.0034504647111804482)]

y=2.88+0.0466?TV+0.179?Radio+0.00345?Newspaper

如何解釋各個(gè)特征對(duì)應(yīng)的系數(shù)的意義？

對(duì)于給定了Radio和Newspaper的廣告投入，如果在TV廣告上每多投入1個(gè)單位，對(duì)應(yīng)銷量將增加0.0466個(gè)單位

更明確一點(diǎn)，加入其它兩個(gè)媒體投入固定，在TV廣告上沒(méi)增加1000美元（因?yàn)閱挝皇?000美元），銷量將增加46.6（因?yàn)閱挝皇?000）

(4)預(yù)測(cè)

In [18]:

y_pred=linreg.predict(X_test)

3. 回歸問(wèn)題的評(píng)價(jià)測(cè)度

對(duì)于分類問(wèn)題，評(píng)價(jià)測(cè)度是準(zhǔn)確率，但這種方法不適用于回歸問(wèn)題。我們使用針對(duì)連續(xù)數(shù)值的評(píng)價(jià)測(cè)度(evaluation metrics)。

下面介紹三種常用的針對(duì)回歸問(wèn)題的評(píng)價(jià)測(cè)度

In [21]:

# define true and predicted response valuestrue=[100,50,30,20]pred=[90,50,50,30]

(1)平均絕對(duì)誤差(Mean Absolute Error, MAE)

1n∑ni=1|yi?yi^|

(2)均方誤差(Mean Squared Error, MSE)

1n∑ni=1(yi?yi^)2

(3)均方根誤差(Root Mean Squared Error, RMSE)

1n∑ni=1(yi?yi^)2?????????????√

In [24]:

fromsklearnimportmetricsimportnumpyasnp# calculate MAE by handprint"MAE by hand:",(10+0+20+10)/4.# calculate MAE using scikit-learnprint"MAE:",metrics.mean_absolute_error(true,pred)# calculate MSE by handprint"MSE by hand:",(10**2+0**2+20**2+10**2)/4.# calculate MSE using scikit-learnprint"MSE:",metrics.mean_squared_error(true,pred)# calculate RMSE by handprint"RMSE by hand:",np.sqrt((10**2+0**2+20**2+10**2)/4.)# calculate RMSE using scikit-learnprint"RMSE:",np.sqrt(metrics.mean_squared_error(true,pred))

MAE by hand: 10.0 MAE: 10.0 MSE by hand: 150.0 MSE: 150.0 RMSE by hand: 12.2474487139 RMSE: 12.2474487139

計(jì)算Sales預(yù)測(cè)的RMSE

In [26]:

printnp.sqrt(metrics.mean_squared_error(y_test,y_pred))

1.40465142303

4. 特征選擇

在之前展示的數(shù)據(jù)中，我們看到Newspaper和銷量之間的線性關(guān)系比較弱，現(xiàn)在我們移除這個(gè)特征，看看線性回歸預(yù)測(cè)的結(jié)果的RMSE如何？

In [27]:

feature_cols=['TV','Radio']X=data[feature_cols]y=data.SalesX_train,X_test,y_train,y_test=train_test_split(X,y,random_state=1)linreg.fit(X_train,y_train)y_pred=linreg.predict(X_test)printnp.sqrt(metrics.mean_squared_error(y_test,y_pred))

1.38790346994

我們將Newspaper這個(gè)特征移除之后，得到RMSE變小了，說(shuō)明Newspaper特征不適合作為預(yù)測(cè)銷量的特征，于是，我們得到了新的模型。我們還可以通過(guò)不同的特征組合得到新的模型，看看最終的誤差是如何的。

CDA數(shù)據(jù)分析師考試相關(guān)入口一覽（建議收藏）：

? 想報(bào)名CDA認(rèn)證考試，點(diǎn)擊>>> “CDA報(bào)名” 了解CDA考試詳情；

? 想學(xué)習(xí)CDA考試教材，點(diǎn)擊>>> “CDA教材” 了解CDA考試詳情；

? 想加入CDA考試題庫(kù)，點(diǎn)擊>>> “CDA題庫(kù)” 了解CDA考試詳情；

? 想了解CDA考試含金量，點(diǎn)擊>>> “CDA含金量” 了解CDA考試詳情；

特征 pandas DataFrame Series 線性回歸 seaborn 有監(jiān)督學(xué)習(xí) matplotlib

數(shù)據(jù)分析咨詢請(qǐng)掃描二維碼

若不方便掃碼，搜微信號(hào)：CDAshujufenxi

上一篇圖論在大數(shù)據(jù)分析中的作用！

下一篇CDA認(rèn)證再升一檔！與國(guó)家共同推進(jìn)大數(shù)據(jù)人才培養(yǎng)標(biāo)準(zhǔn)教育事業(yè)！

CDA報(bào)考指南

報(bào)考流程
考試時(shí)間
報(bào)名費(fèi)用
聯(lián)系我們

數(shù)據(jù)分析學(xué)習(xí)

數(shù)據(jù)分析師資訊

京公網(wǎng)安備 11010802034615號(hào) 經(jīng)營(yíng)許可證編號(hào)：京B2-20210330

聯(lián)系電話：13321103290 (微信同號(hào))

CDA教材
CDA題庫(kù)
CDA大綱

客服在線

立即咨詢

客服在線

立即咨詢

免密碼登錄

提交首次登錄驗(yàn)證后自動(dòng)注冊(cè)

') } function initGt() { var handler = function (captchaObj) { captchaObj.appendTo('#captcha'); captchaObj.onReady(function () { $("#wait").hide(); }).onSuccess(function(){ $('.getcheckcode').removeClass('dis'); $('.getcheckcode').trigger('click'); }); window.captchaObj = captchaObj; }; $('#captcha').show(); $.ajax({ url: "/login/gtstart?t=" + (new Date()).getTime(), // 加隨機(jī)數(shù)防止緩存 type: "get", dataType: "json", success: function (data) { $('#text').hide(); $('#wait').show(); // 調(diào)用 initGeetest 進(jìn)行初始化 // 參數(shù)1：配置參數(shù) // 參數(shù)2：回調(diào)，回調(diào)的第一個(gè)參數(shù)驗(yàn)證碼對(duì)象，之后可以使用它調(diào)用相應(yīng)的接口 initGeetest({ // 以下 4 個(gè)配置參數(shù)為必須，不能缺少 gt: data.gt, challenge: data.challenge, offline: !data.success, // 表示用戶后臺(tái)檢測(cè)極驗(yàn)服務(wù)器是否宕機(jī) new_captcha: data.new_captcha, // 用于宕機(jī)時(shí)表示是新驗(yàn)證碼的宕機(jī) product: "float", // 產(chǎn)品形式，包括：float，popup width: "280px", https: true // 更多配置參數(shù)說(shuō)明請(qǐng)參見(jiàn)：http://docs.geetest.com/install/client/web-front/ }, handler); } }); } function codeCutdown() { if(_wait == 0){ //倒計(jì)時(shí)完成 $(".getcheckcode").removeClass('dis').html("重新獲取"); }else{ $(".getcheckcode").addClass('dis').html("重新獲取("+_wait+"s)"); _wait--; setTimeout(function () { codeCutdown(); },1000); } } function inputValidate(ele,telInput) { var oInput = ele; var inputVal = oInput.val(); var oType = ele.attr('data-type'); var oEtag = $('#etag').val(); var oErr = oInput.closest('.form_box').next('.err_txt'); var empTxt = '請(qǐng)輸入'+oInput.attr('placeholder')+'！'; var errTxt = '請(qǐng)輸入正確的'+oInput.attr('placeholder')+'！'; var pattern; if(inputVal==""){ if(!telInput){ errFun(oErr,empTxt); } return false; }else { switch (oType){ case 'login_mobile': pattern = /^1[3456789]\d{9}$/; if(inputVal.length==11) { $.ajax({ url: '/login/checkmobile', type: "post", dataType: "json", data: { mobile: inputVal, etag: oEtag, page_ur: window.location.href, page_referer: document.referrer }, success: function (data) { } }); } break; case 'login_yzm': pattern = /^\d{6}$/; break; } if(oType=='login_mobile'){ } if(!!validateFun(pattern,inputVal)){ errFun(oErr,'') if(telInput){ $('.getcheckcode').removeClass('dis'); } }else { if(!telInput) { errFun(oErr, errTxt); }else { $('.getcheckcode').addClass('dis'); } return false; } } return true; } function errFun(obj,msg) { obj.html(msg); if(msg==''){ $('.login_submit').removeClass('dis'); }else { $('.login_submit').addClass('dis'); } } function validateFun(pat,val) { return pat.test(val); }

99999久久久久久亚洲,欧美人与禽猛交狂配,高清日韩av在线影院,一个人在线高清免费观看,啦啦啦在线视频免费观看www

scikit-learn的線性回歸模型

1. 使用pandas來(lái)讀取數(shù)據(jù)

2. 線性回歸模型

（1）使用pandas來(lái)構(gòu)建X和y

scikit-learn要求X是一個(gè)特征矩陣，y是一個(gè)NumPy向量

pandas構(gòu)建在NumPy之上

因此，X可以是pandas的DataFrame，y可以是pandas的Series，scikit-learn可以理解這種結(jié)構(gòu)

(2)構(gòu)造訓(xùn)練集和測(cè)試集

(3)Scikit-learn的線性回歸

(4)預(yù)測(cè)

3. 回歸問(wèn)題的評(píng)價(jià)測(cè)度

計(jì)算Sales預(yù)測(cè)的RMSE

4. 特征選擇

數(shù)據(jù)分析師考試動(dòng)態(tài)

CDA報(bào)考指南

數(shù)據(jù)分析學(xué)習(xí)

數(shù)據(jù)分析師資訊

【CDA干貨】SQL Server 中 CONVERT 函數(shù)的日期轉(zhuǎn)換 ...

【CDA干貨】MySQL 大表拆分與關(guān)聯(lián)查詢效率：打破 “ ...

CDA 數(shù)據(jù)分析師：表結(jié)構(gòu)數(shù)據(jù) “獲取 - 加工 - 使用 ...

【CDA干貨】DSGE 模型中的 Et：理性預(yù)期算子的內(nèi)涵 ...

【CDA干貨】Python 提取 TIF 中地名的完整指南 ...

CDA 數(shù)據(jù)分析師：解鎖表結(jié)構(gòu)數(shù)據(jù)特征價(jià)值的專業(yè)核心 ...

【CDA干貨】Excel 導(dǎo)入數(shù)據(jù)含缺失值？詳解 dropna ...

【CDA干貨】深入解析卡方檢驗(yàn)與 t 檢驗(yàn)：差異、適用 ...

CDA 數(shù)據(jù)分析師：掌控表格結(jié)構(gòu)數(shù)據(jù)全功能周期的專業(yè) ...

【CDA干貨】MySQL 執(zhí)行計(jì)劃中 rows 數(shù)量的準(zhǔn)確性解 ...

【CDA干貨】解析 Python 中 Response 對(duì)象的 text ...

CDA 數(shù)據(jù)分析師：激活表格結(jié)構(gòu)數(shù)據(jù)價(jià)值的核心操盤手 ...

【CDA干貨】Python HTTP 請(qǐng)求工具對(duì)比：urllib.requ ...

【CDA干貨】解決 pd.read\_csv 讀取長(zhǎng)浮點(diǎn)數(shù)據(jù)的科 ...

CDA 數(shù)據(jù)分析師：業(yè)務(wù)數(shù)據(jù)分析步驟的落地者與價(jià)值優(yōu) ...

【CDA干貨】用 SQL 驗(yàn)證業(yè)務(wù)邏輯：從規(guī)則拆解到數(shù)據(jù) ...

【CDA干貨】塔吉特百貨孕婦營(yíng)銷案例：數(shù)據(jù)驅(qū)動(dòng)下的 ...

CDA 數(shù)據(jù)分析師與戰(zhàn)略 / 業(yè)務(wù)數(shù)據(jù)分析：概念辨析與 ...

【CDA干貨】Excel 數(shù)據(jù)聚類分析：從操作實(shí)踐到業(yè)務(wù) ...

【CDA干貨】統(tǒng)計(jì)模型的核心目的：從數(shù)據(jù)解讀到?jīng)Q策 ...

CDA教育閉環(huán)

常見(jiàn)問(wèn)題

關(guān)于我們

CDA數(shù)據(jù)分析師公眾號(hào)

CDA考試中心小程序

CDA數(shù)據(jù)分析師App下載

scikit-learn要求X是一個(gè)特征矩陣，y是一個(gè)NumPy向量

因此，X可以是pandas的DataFrame，y可以是pandas的Series，scikit-learn可以理解這種結(jié)構(gòu)

【CDA干貨】Excel 導(dǎo)入數(shù)據(jù)含缺失值？詳解 dropna ...

【CDA干貨】深入解析卡方檢驗(yàn)與 t 檢驗(yàn)：差異、適用 ...