99999久久久久久亚洲,欧美人与禽猛交狂配,高清日韩av在线影院,一个人在线高清免费观看,啦啦啦在线视频免费观看www

熱線電話:13121318867

登錄
首頁精彩閱讀R爬蟲之京東商城手機信息批量獲取
R爬蟲之京東商城手機信息批量獲取
2017-06-27
收藏

R爬蟲之京東商城手機信息批量獲取

人手一部智能手機的移動互聯(lián)網(wǎng)時代,智能手機對很多人來說,它就像我們身上生長出來的一個器官那樣重要。如果你不能對各大品牌的『賣點』和『受眾』侃上一陣,很可能會被懷疑不是地球人。
今天我們來探索一下,如何從『京東商城』爬取各大品牌的手機信息。

1.預備知識
R爬蟲需要掌握的技能包括:
    基本的網(wǎng)頁知識,如html,XML文件的解析
    分析XPath
    使用網(wǎng)頁開發(fā)工具
    異常捕捉的處理
    字符串的處理
    正則表達式的使用
    數(shù)據(jù)庫的基本操作
不過不要擔心,目前只需要掌握前三項技能,即可開始練習。
前三項技能的掌握可以參考 Automated Data Collection with R 一書。正常情況下,一天之內大致即可掌握。
2.頁面分析
(待完善)
3.提取各大品牌的鏈接
#### packages we need ####
## ----------------------------------------------------------------------- ##
require(stringr)
require(XML)
require(RCurl)
library(Rwebdriver)

setwd("JDDownload")

BaseUrl<-"http://search.jd.com"

quit_session()
start_session(root = "http://localhost:4444/wd/hub/",browser = "firefox")

# post Base Url
post.url(url = BaseUrl)

SearchField<-element_xpath_find(value = '//*[@id="keyword"]')
SearchButton<-element_xpath_find(value = '//*[@id="gwd_360buy"]/body/div[2]/form/input[3]')
#keyword for search
keywords<-'手機'

element_click(SearchField)
keys(keywords)
element_click(SearchButton)
Sys.sleep(1)
#test
get.url()

pageSource<-page_source()
parsedSourcePage<-htmlParse(pageSource, encoding = 'UTF-8')
## Download Search Results
fname <- paste0(keywords, " SearchPage 1.html")
writeLines(pageSource, fname)

#get all the brand url
Brand<-'//*[@id="J_selector"]/div[1]/div/div[2]/div[3]/ul/li/a/@href'
BrandLinks<-xpathSApply(doc = parsedSourcePage, path = Brand)

View(data.frame(BrandLinks))

BrandLinks<-sapply(BrandLinks,function(x){
  paste0(BaseUrl,"/",x)
  })

save(BrandLinks,file = 'BrandLinks.rda')

4.訪問每個品牌的頁面,抓取每個品牌下的商品鏈接

##############Function 1 #################################3##

### 對各品牌的手機頁面進行抓取       ########3#

getBrandPage<-function(BrandUrl,foreDownload = T){
  #獲取某品牌搜索頁面
  post.url(BrandUrl)
  Brand_pageSource<-page_source()
  #parse
  parsedSourcePage<-htmlParse(Brand_pageSource, encoding = 'UTF-8')
 
  #get brand name
  BrandNamePath<-'//*[@id="J_crumbsBar"]/div[2]/div/a/em'
  BrandName<-xpathSApply(doc = parsedSourcePage, path = BrandNamePath, fun = xmlValue)
 
  #Save the page
  BrandPageName<-paste0(BrandName,'_PageSource.html')
  #Create a file
  if(!file.exists(BrandName)) dir.create(BrandName)
  # save
  writeLines(text = Brand_pageSource, con = paste0(BrandName,'/',BrandPageName))
 
  # get the product page url
    #path
    Brand_AllProductPath<-'//*[@id="J_goodsList"]/ul/li/div/div[4]/a/@href'
   #url
    Brand_AllProductLinks<-xpathSApply(doc = parsedSourcePage, path = Brand_AllProductPath)
 
#     #remove some false url
#     FalseLink<-grep(x = Brand_AllProductLinks,pattern = 'https',fixed = TRUE)
#     Brand_AllProductLinks<-Brand_AllProductLinks[-FalseLink]
    
    # add a head
    Brand_AllProductLinks<-str_c('http:',Brand_AllProductLinks)
  #save and return the url
    save(Brand_AllProductLinks,file = paste0(BrandName,'_AllProductLinks.rda'))
    return(Brand_AllProductLinks)
}

# test
BrandUrl<-BrandLinks[1]

getBrandPage(BrandUrl)

#get all the links
Brand_ProductLink<-list()
for(i in 1:length(BrandLinks)){
  Sys.sleep(10)
  Brand_ProductLink[[i]]<-getBrandPage(BrandUrl = BrandLinks[i])
}

#clean the links
All_ProductLink<-lapply(Brand_ProductLink,function(x){
   TrueLink<-grep(x = x,pattern = 'http://item.jd.com/',fixed = TRUE,value = FALSE)
   return(x[TrueLink])
})
# save the links
save(All_ProductLink,file = 'All_ProductLink.rda')

5.訪問每個商品頁面,提取有用信息

我們初步提取如下指標:標題(Title),賣點(KeyCount),價格(Price),評論數(shù)(commentCount),尺寸(Size),后置攝像頭像素(BackBit),后置攝像頭像素(ForwardBit),核數(shù)(Core),分辨率(Resolution),品牌(Brand),上架時間(onSaleTime).

#################################################
######## Function2 :訪問每個商品頁面,提取有用信息  ########

Product<-function(ProductLink){
  post.url(ProductLink)
  Sys.sleep(4)
 
  # get the page
  Product_pageSource<-page_source()
 
  #parse
  Parsed_product_Page<-htmlParse(Product_pageSource, encoding = 'UTF-8')
 
  # get title,,key count,price,CommentCount and so on
 
  #PATH
  TitlePath<-'//*[@id="name"]/h1'
  KeyCountPath<-'//*[@id="p-ad"]'
  PricePath<-'//*[@id="jd-price"]'
  commentCountPath<-'//*[@id="comment-count"]/a'
  SizePath<-'//*[@id="parameter1"]/li[1]/div/p[1]'
  BackBitPath<-'//*[@id="parameter1"]/li[2]/div/p[1]'
  ForwardBitPath<-'//*[@id="parameter1"]/li[2]/div/p[2]'
  CorePath<-'//*[@id="parameter1"]/li[3]/div/p[1]'
  NamePath<-'//*[@id="parameter2"]/li[1]'
  CodePath<-'//*[@id="parameter2"]/li[2]'
  BrandPath<-'//*[@id="parameter2"]/li[3]'
  onSaleTimePath<-'//*[@id="parameter2"]/li[4]'
  ResolutionPath<-'//*[@id="parameter1"]/li[1]/div/p[2]'
 
  Title<-xpathSApply(doc = Parsed_product_Page,path = TitlePath,xmlValue)
  KeyCount<-xpathSApply(doc = Parsed_product_Page,path = KeyCountPath,xmlValue)
  Price<-xpathSApply(doc = Parsed_product_Page,path = PricePath,xmlValue)
  commentCount<-xpathSApply(doc = Parsed_product_Page,path = commentCountPath,xmlValue)
  Size<-xpathSApply(doc = Parsed_product_Page,path = SizePath,xmlValue)
  BackBit<-xpathSApply(doc = Parsed_product_Page,path = BackBitPath,xmlValue)
  ForwardBit<-xpathSApply(doc = Parsed_product_Page,path = ForwardBitPath,xmlValue)
  Core<-xpathSApply(doc = Parsed_product_Page,path = CorePath,xmlValue)
  Name<-xpathSApply(doc = Parsed_product_Page,path = NamePath,xmlValue)
  Code<-xpathSApply(doc = Parsed_product_Page,path = CodePath,xmlValue)
  Resolution<-xpathSApply(doc = Parsed_product_Page,path = ResolutionPath,xmlValue)
  Brand<-xpathSApply(doc = Parsed_product_Page,path = BrandPath,xmlValue)
  onSaleTime<-xpathSApply(doc = Parsed_product_Page,path = onSaleTimePath,xmlValue)
 
  # 整理成data frame
  mydata<-data.frame(Title = Title,KeyCount = KeyCount, Price = Price,
                     commentCount = commentCount, Size = Size, BackBit = BackBit,
                     ForwardBit = ForwardBit, Core = Core, Name = Name,Code = Code,
                     Resolution = Resolution,
                     Brand = Brand, onSaleTime = onSaleTime)
  #save the page  
  FileName<-paste0('Product/',Brand,Code,'_pageSource.html')
  writeLines(text = Product_pageSource,con = FileName)
 #return the data
  return(mydata)
 
}

# test
quit_session()
start_session(root = "http://localhost:4444/wd/hub/",browser = "firefox")

load(file = 'All_ProductLink.rda')

ProductLink1<-All_ProductLink[[40]][1]

testData<-Product(ProductLink = ProductLink1)

#定義tryCatch

mySpider<-function(ProductLink){
  out<-tryCatch(
    {
      message('This is the try part:')
     Product(ProductLink = ProductLink)
    },
    error=function(e){
      message(e)
      return(NA)
    },
    finally = {
      message("The end!")
    }
  )
  return(out)
}

## loop

# get all data
ProductInformation<-list()
k <-0

for(i in 1:length(All_ProductLink)){
  for(j in 1:length(All_ProductLink[[i]])){
    k<-k+1
    ProductInformation[[k]]<-mySpider(ProductLink = All_ProductLink[[i]][j])
  }
}

# save my data
MobilePhoneInformation<-do.call(rbind,ProductInformation)
View(MobilePhoneInformation)
save(MobilePhoneInformation,file = 'MobilePhoneInformation.rda')

nrow(na.omit(MobilePhoneInformation))
View(MobilePhoneInformation)

最終,獲得800多行的信息,除去缺失值,剩下600多行數(shù)據(jù),還不賴。 最后的數(shù)據(jù)可以在這里獲得。

不過,數(shù)據(jù)還需要進一步清洗方能進行分析。

數(shù)據(jù)分析咨詢請掃描二維碼

若不方便掃碼,搜微信號:CDAshujufenxi

數(shù)據(jù)分析師考試動態(tài)
數(shù)據(jù)分析師資訊
更多

OK
客服在線
立即咨詢
客服在線
立即咨詢
') } function initGt() { var handler = function (captchaObj) { captchaObj.appendTo('#captcha'); captchaObj.onReady(function () { $("#wait").hide(); }).onSuccess(function(){ $('.getcheckcode').removeClass('dis'); $('.getcheckcode').trigger('click'); }); window.captchaObj = captchaObj; }; $('#captcha').show(); $.ajax({ url: "/login/gtstart?t=" + (new Date()).getTime(), // 加隨機數(shù)防止緩存 type: "get", dataType: "json", success: function (data) { $('#text').hide(); $('#wait').show(); // 調用 initGeetest 進行初始化 // 參數(shù)1:配置參數(shù) // 參數(shù)2:回調,回調的第一個參數(shù)驗證碼對象,之后可以使用它調用相應的接口 initGeetest({ // 以下 4 個配置參數(shù)為必須,不能缺少 gt: data.gt, challenge: data.challenge, offline: !data.success, // 表示用戶后臺檢測極驗服務器是否宕機 new_captcha: data.new_captcha, // 用于宕機時表示是新驗證碼的宕機 product: "float", // 產(chǎn)品形式,包括:float,popup width: "280px", https: true // 更多配置參數(shù)說明請參見:http://docs.geetest.com/install/client/web-front/ }, handler); } }); } function codeCutdown() { if(_wait == 0){ //倒計時完成 $(".getcheckcode").removeClass('dis').html("重新獲取"); }else{ $(".getcheckcode").addClass('dis').html("重新獲取("+_wait+"s)"); _wait--; setTimeout(function () { codeCutdown(); },1000); } } function inputValidate(ele,telInput) { var oInput = ele; var inputVal = oInput.val(); var oType = ele.attr('data-type'); var oEtag = $('#etag').val(); var oErr = oInput.closest('.form_box').next('.err_txt'); var empTxt = '請輸入'+oInput.attr('placeholder')+'!'; var errTxt = '請輸入正確的'+oInput.attr('placeholder')+'!'; var pattern; if(inputVal==""){ if(!telInput){ errFun(oErr,empTxt); } return false; }else { switch (oType){ case 'login_mobile': pattern = /^1[3456789]\d{9}$/; if(inputVal.length==11) { $.ajax({ url: '/login/checkmobile', type: "post", dataType: "json", data: { mobile: inputVal, etag: oEtag, page_ur: window.location.href, page_referer: document.referrer }, success: function (data) { } }); } break; case 'login_yzm': pattern = /^\d{6}$/; break; } if(oType=='login_mobile'){ } if(!!validateFun(pattern,inputVal)){ errFun(oErr,'') if(telInput){ $('.getcheckcode').removeClass('dis'); } }else { if(!telInput) { errFun(oErr, errTxt); }else { $('.getcheckcode').addClass('dis'); } return false; } } return true; } function errFun(obj,msg) { obj.html(msg); if(msg==''){ $('.login_submit').removeClass('dis'); }else { $('.login_submit').addClass('dis'); } } function validateFun(pat,val) { return pat.test(val); }