99999久久久久久亚洲,欧美人与禽猛交狂配,高清日韩av在线影院,一个人在线高清免费观看,啦啦啦在线视频免费观看www

熱線電話:13121318867

登錄
首頁(yè)精彩閱讀數(shù)據(jù)分析實(shí)例--R語言如何對(duì)垃圾郵件進(jìn)行分類
數(shù)據(jù)分析實(shí)例--R語言如何對(duì)垃圾郵件進(jìn)行分類
2017-07-07
收藏

數(shù)據(jù)分析實(shí)例--R語言如何對(duì)垃圾郵件進(jìn)行分類

Structure of a Data Analysis

數(shù)據(jù)分析的步驟

l  Define the question

l  Define the ideal data set

l  Determine what data you can access

l  Obtain the data

l  Clean the data

l  Exploratory data analysis

l  Statistical prediction/model

l  Interpret results

l  Challenge results

l  Synthesize/write up results

l  Create reproducible code

2   A sample

1)    問題.

Can I automatically detect emails that are SPAM or not?

2)    具體化問題

Can I use quantitative characteristics of the emails to classify them as SPAM/HAM?

3)    獲取數(shù)據(jù)

http://search.r-project.org/library/kernlab/html/spam.html

4)    取樣

#if it isn't installed,please install the package first.

library(kernlab)

data(spam)


#perform the subsampling

set.seed(3435)

trainIndicator =rbinom(4601,size = 1,prob = 0.5)

table(trainIndicator)


trainSpam = spam[trainIndicator == 1, ]

testSpam = spam[trainIndicator == 0, ]

5)    初步分析

a)      Names:查看的列名

names(trainSpam)

b)      Head:查看前六行

head(trainSpam)

c)       Summaries:匯總

table(trainSpam$type)

d)      Plots:畫圖,查看垃圾郵件及非垃圾郵件的分布

plot(trainSpam$capitalAve ~ trainSpam$type)

上圖分布不明顯,我們?nèi)?duì)數(shù)后,再看看

plot(log10(trainSpam$capitalAve + 1) ~ trainSpam$type)

e)      尋找預(yù)測(cè)的內(nèi)在關(guān)系

plot(log10(trainSpam[, 1:4] + 1))

f)        試用層次聚類

hCluster = hclust(dist(t(trainSpam[, 1:57])))

plot(hCluster)

太亂了.不能發(fā)現(xiàn)些什么。老方法不是取log看看

hClusterUpdated = hclust(dist(t(log10(trainSpam[, 1:55] + 1))))

plot(hClusterUpdated)



6)    統(tǒng)計(jì)預(yù)測(cè)及建模

trainSpam$numType = as.numeric(trainSpam$type) - 1

costFunction = function(x, y) sum(x != (y > 0.5))

cvError = rep(NA, 55)

library(boot)

for (i in 1:55) {

lmFormula = reformulate(names(trainSpam)[i], response = "numType")

glmFit = glm(lmFormula, family = "binomial", data = trainSpam)

cvError[i] = cv.glm(trainSpam, glmFit, costFunction, 2)$delta[2]

}

## Which predictor has minimum cross-validated error?

names(trainSpam)[which.min(cvError)]

7)     檢測(cè)

## Use the best model from the group

predictionModel = glm(numType ~ charDollar, family = "binomial", data = trainSpam)

## Get predictions on the test set

predictionTest = predict(predictionModel, testSpam)

predictedSpam = rep("nonspam", dim(testSpam)[1])

## Classify as `spam' for those with prob > 0.5

predictedSpam[predictionModel$fitted > 0.5] = "spam"

## Classification table 查看分類結(jié)果

table(predictedSpam, testSpam$type)

分類錯(cuò)誤率:0.2243 =(61 + 458)/(1346 + 458 + 61 + 449)

8)    Interpret results(結(jié)果解釋)

The fraction of charcters that are dollar signs can be used to predict if an email is Spam

Anything with more than 6.6% dollar signs is classified as Spam

More dollar signs always means more Spam under our prediction

Our test set error rate was 22.4%

9)    Challenge results

10)  Synthesize/write up results

11)   Create reproducible code


數(shù)據(jù)分析咨詢請(qǐng)掃描二維碼

若不方便掃碼,搜微信號(hào):CDAshujufenxi

數(shù)據(jù)分析師資訊
更多

OK
客服在線
立即咨詢
客服在線
立即咨詢
') } function initGt() { var handler = function (captchaObj) { captchaObj.appendTo('#captcha'); captchaObj.onReady(function () { $("#wait").hide(); }).onSuccess(function(){ $('.getcheckcode').removeClass('dis'); $('.getcheckcode').trigger('click'); }); window.captchaObj = captchaObj; }; $('#captcha').show(); $.ajax({ url: "/login/gtstart?t=" + (new Date()).getTime(), // 加隨機(jī)數(shù)防止緩存 type: "get", dataType: "json", success: function (data) { $('#text').hide(); $('#wait').show(); // 調(diào)用 initGeetest 進(jìn)行初始化 // 參數(shù)1:配置參數(shù) // 參數(shù)2:回調(diào),回調(diào)的第一個(gè)參數(shù)驗(yàn)證碼對(duì)象,之后可以使用它調(diào)用相應(yīng)的接口 initGeetest({ // 以下 4 個(gè)配置參數(shù)為必須,不能缺少 gt: data.gt, challenge: data.challenge, offline: !data.success, // 表示用戶后臺(tái)檢測(cè)極驗(yàn)服務(wù)器是否宕機(jī) new_captcha: data.new_captcha, // 用于宕機(jī)時(shí)表示是新驗(yàn)證碼的宕機(jī) product: "float", // 產(chǎn)品形式,包括:float,popup width: "280px", https: true // 更多配置參數(shù)說明請(qǐng)參見:http://docs.geetest.com/install/client/web-front/ }, handler); } }); } function codeCutdown() { if(_wait == 0){ //倒計(jì)時(shí)完成 $(".getcheckcode").removeClass('dis').html("重新獲取"); }else{ $(".getcheckcode").addClass('dis').html("重新獲取("+_wait+"s)"); _wait--; setTimeout(function () { codeCutdown(); },1000); } } function inputValidate(ele,telInput) { var oInput = ele; var inputVal = oInput.val(); var oType = ele.attr('data-type'); var oEtag = $('#etag').val(); var oErr = oInput.closest('.form_box').next('.err_txt'); var empTxt = '請(qǐng)輸入'+oInput.attr('placeholder')+'!'; var errTxt = '請(qǐng)輸入正確的'+oInput.attr('placeholder')+'!'; var pattern; if(inputVal==""){ if(!telInput){ errFun(oErr,empTxt); } return false; }else { switch (oType){ case 'login_mobile': pattern = /^1[3456789]\d{9}$/; if(inputVal.length==11) { $.ajax({ url: '/login/checkmobile', type: "post", dataType: "json", data: { mobile: inputVal, etag: oEtag, page_ur: window.location.href, page_referer: document.referrer }, success: function (data) { } }); } break; case 'login_yzm': pattern = /^\d{6}$/; break; } if(oType=='login_mobile'){ } if(!!validateFun(pattern,inputVal)){ errFun(oErr,'') if(telInput){ $('.getcheckcode').removeClass('dis'); } }else { if(!telInput) { errFun(oErr, errTxt); }else { $('.getcheckcode').addClass('dis'); } return false; } } return true; } function errFun(obj,msg) { obj.html(msg); if(msg==''){ $('.login_submit').removeClass('dis'); }else { $('.login_submit').addClass('dis'); } } function validateFun(pat,val) { return pat.test(val); }