99999久久久久久亚洲,欧美人与禽猛交狂配,高清日韩av在线影院,一个人在线高清免费观看,啦啦啦在线视频免费观看www

熱線電話：13121318867

登錄

首頁精彩閱讀干貨 | 數(shù)據(jù)分析實(shí)戰(zhàn)案例——用戶行為預(yù)測

干貨 | 數(shù)據(jù)分析實(shí)戰(zhàn)案例——用戶行為預(yù)測

2021-12-22

CDA數(shù)據(jù)分析師出品

作者：CDA教研組

編輯：Mika

案例介紹

背景：以某大型電商平臺的用戶行為數(shù)據(jù)為數(shù)據(jù)集，使用大數(shù)據(jù)處理技術(shù)分析海量數(shù)據(jù)下的用戶行為特征，并通過建立邏輯回歸模型、隨機(jī)森林對用戶行為做出預(yù)測;

案例思路:

使用大數(shù)據(jù)處理技術(shù)讀取海量數(shù)據(jù)
海量數(shù)據(jù)預(yù)處理
抽取部分?jǐn)?shù)據(jù)調(diào)試模型
使用海量數(shù)據(jù)搭建模型

#全部行輸出
from
IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

數(shù)據(jù)字典:

U_Id:the serialized ID that represents a user

T_Id:the serialized ID that represents an item

C_Id:the serialized ID that represents the category which the corresponding item belongs to Ts:the timestamp of the behavior

Be_type:enum-type from (‘pv’, ‘buy’, ‘cart’, ‘fav’)

pv: Page view of an item's detail page, equivalent to an item click

buy: Purchase an item

cart: Add an item to shopping cart
fav: Favor an item

讀取數(shù)據(jù)

這里關(guān)鍵是使用dask庫來處理海量數(shù)據(jù)，它的大多數(shù)操作的運(yùn)行速度比常規(guī)pandas等庫快十倍左右。

pandas在分析結(jié)構(gòu)化數(shù)據(jù)方面非常的流行和強(qiáng)大，但是它最大的限制就在于設(shè)計(jì)時(shí)沒有考慮到可伸縮性。pandas特別適合處理小型結(jié)構(gòu)化數(shù)據(jù)，并且經(jīng)過高度優(yōu)化，可以對存儲在內(nèi)存中的數(shù)據(jù)執(zhí)行快速高效的操作。然而隨著數(shù)據(jù)量的大幅度增加，單機(jī)肯定會讀取不下的，通過集群的方式來處理是最好的選擇。這就是Dask DataFrame API發(fā)揮作用的地方:通過為pandas提供一個(gè)包裝器，可以智能的將巨大的DataFrame分隔成更小的片段，并將它們分散到多個(gè)worker(幀)中，并存儲在磁盤中而不是RAM中。

Dask DataFrame會被分割成多個(gè)部門，每個(gè)部分稱之為一個(gè)分區(qū)，每個(gè)分區(qū)都是一個(gè)相對較小的 DataFrame，可以分配給任意的worker，并在需要復(fù)制時(shí)維護(hù)其完整數(shù)據(jù)。具體操作就是對每個(gè)分區(qū)并行或單獨(dú)操作(多個(gè)機(jī)器的話也可以并行)，然后再將結(jié)果合并，其實(shí)從直觀上也能推出Dask肯定是這么做的。

# 安裝庫(清華鏡像)
# pip install dask -i
https://pypi.tuna.tsinghua.edu.cn/simple
import os
import gc # 垃圾回收接口
from tqdm import tqdm # 進(jìn)度條庫
import dask # 并行計(jì)算接口
from dask.diagnostics import ProgressBar
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
import dask.dataframe as dd # dask中的數(shù)表處理庫 import sys # 外部參數(shù)獲取接口

面對海量數(shù)據(jù)，跑完一個(gè)模塊的代碼就可以加一行g(shù)c.collect()來做內(nèi)存碎片回收，Dask Dataframes與Pandas Dataframes具有相同的API

gc.collect()

# 加載數(shù)據(jù)
data = dd.read_csv('UserBehavior_all.csv')# 需要時(shí)可以設(shè)置blocksize=參數(shù)來手工指定劃分方法，默認(rèn)是64MB(需要設(shè)置為總線的倍數(shù)，否則會放慢速度)
data.head()

.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}

data
Dask DataFrame Structure :

.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}

Dask Name: read-csv, 58 tasks

與pandas不同，這里我們僅獲取數(shù)據(jù)框的結(jié)構(gòu)，而不是實(shí)際數(shù)據(jù)框。Dask已將數(shù)據(jù)幀分為幾塊加載，這些塊存在于磁盤上，而不存在于RAM中。如果必須輸出數(shù)據(jù)幀，則首先需要將所有數(shù)據(jù)幀都放入RAM，將它們縫合在一起，然后展示最終的數(shù)據(jù)幀。使用.compute()強(qiáng)迫它這樣做，否則它不.compute() 。其實(shí)dask使用了一種延遲數(shù) 據(jù)加載機(jī)制，這種延遲機(jī)制類似于python的迭代器組件，只有當(dāng)需要使用數(shù)據(jù)的時(shí)候才會去真正加載數(shù)據(jù)。

# 真正加載數(shù)據(jù) data.compute()

.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}

# 可視化工作進(jìn)程，58個(gè)分區(qū)任務(wù) data.visualize()

數(shù)據(jù)預(yù)處理

數(shù)據(jù)壓縮

# 查看現(xiàn)在的數(shù)據(jù)類型 data.dtypes

U_Id int64
T_Id int64
C_Id int64
Be_type object
Ts int64
dtype: object

# 壓縮成32位uint，無符號整型，因?yàn)榻灰讛?shù)據(jù)沒有負(fù)數(shù) dtypes = {
'U_Id': 'uint32',
'T_Id': 'uint32',
'C_Id': 'uint32',
'Be_type': 'object',
'Ts': 'int64'
}
data = data.astype(dtypes)

data.dtypes

U_Id uint32
T_Id uint32
C_Id uint32
Be_type object
Ts int64
dtype: object

缺失值

# 以dask接口讀取的數(shù)據(jù)，無法直接用.isnull()等pandas常用函數(shù)篩查缺失值
data.isnull()

Dask DataFrame Structure :

.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}

columns1 = [ 'U_Id', 'T_Id', 'C_Id', 'Be_type', 'Ts']
tmpDf1 = pd.DataFrame(columns=columns1)
tmpDf1

.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}

s = data["U_Id"].isna()
s.loc[s == True]

Dask Series Structure:
npartitions=58
bool ...
... ...
...
Name: U_Id, dtype: bool
Dask Name: loc-series, 348 tasks

U_Id列缺失值數(shù)目為0
T_Id列缺失值數(shù)目為0
C_Id列缺失值數(shù)目為0
Be_type列缺失值數(shù)目為0
Ts列缺失值數(shù)目為0

.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}

無缺失值

數(shù)據(jù)探索與可視化

這里我們使用pyecharts庫。pyecharts是一款將python與百度開源的echarts結(jié)合的數(shù)據(jù)可視化工具。新版的1.X和舊版的0.5.X版本代碼規(guī)則大不相同，新版詳見官方文檔
https://gallery.pyecharts.org/#/README

# pip install pyecharts -i https://pypi.tuna.tsinghua.edu.cn/simple

Looking in indexes: https:
//pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: pyecharts in d:anacondalibsite-packages (0.1.9.4)
Requirement already satisfied: jinja2 in d:anacondalibsite-packages (from pyecharts)
(3.0.2)
Requirement already satisfied: future in d:anacondalibsite-packages (from pyecharts)
(0.18.2)
Requirement already satisfied: pillow in d:anacondalibsite-packages (from pyecharts)
(8.3.2)
Requirement already satisfied: MarkupSafe>=2.0 in d:anacondalibsite-packages (from
jinja2->pyecharts) (2.0.1)
Note: you may need to restart the kernel to use updated packages.
U_Id列缺失值數(shù)目為0 T_Id列缺失值數(shù)目為0 C_Id列缺失值數(shù)目為0 Be_type列缺失值數(shù)目為0 Ts列缺失值數(shù)目為0

WARNING: Ignoring invalid distribution -umpy (d:anacondalibsite-packages)
WARNING: Ignoring invalid distribution -ip (d:anacondalibsite-packages)
WARNING: Ignoring invalid distribution -umpy (d:anacondalibsite-packages)
WARNING: Ignoring invalid distribution -ip (d:anacondalibsite-packages)
WARNING: Ignoring invalid distribution -umpy (d:anacondalibsite-packages)
WARNING: Ignoring invalid distribution -ip (d:anacondalibsite-packages)
WARNING: Ignoring invalid distribution -umpy (d:anacondalibsite-packages)
WARNING: Ignoring invalid distribution -ip (d:anacondalibsite-packages)
WARNING: Ignoring invalid distribution -umpy (d:anacondalibsite-packages)
WARNING: Ignoring invalid distribution -ip (d:anacondalibsite-packages)

餅圖

# 例如，我們想畫一張漂亮的餅圖來看各種用戶行為的占比 data["Be_type"]

# 使用dask的時(shí)候，所有支持的原pandas的函數(shù)后面需加.compute()才能最終執(zhí)行
Be_counts = data["Be_type"].value_counts().compute()
Be_counts

pv 89716264
cart 5530446
fav 2888258
buy 2015839
Name: Be_type, dtype: int64

Be_index = Be_counts.index.tolist() # 提取標(biāo)簽
Be_index

['pv', 'cart', 'fav', 'buy']

Be_values = Be_counts.values.tolist() # 提取數(shù)值
Be_values

[89716264, 5530446, 2888258, 2015839]

from pyecharts import options as opts
from pyecharts.charts import Pie

#pie這個(gè)包里的數(shù)據(jù)必須傳入由元組組成的列表
c = Pie()
c.add("", [list(z) for z in zip(Be_index, Be_values)]) # zip函數(shù)的作用是將可迭代對象打包成一個(gè)個(gè)元組，然后返回這些元組組成的列表 c.set_global_opts(title_opts=opts.TitleOpts(title="用戶行為")) # 全局參數(shù)(圖命名) c.set_series_opts(label_opts=opts.LabelOpts(formatter=": {c}"))
c.render_notebook() # 輸出到當(dāng)前notebook環(huán)境
# c.render("pie_base.html") # 若需要可以將圖輸出到本機(jī)

<pyecharts.charts.basic_charts.pie.Pie at 0x1b2da75ae48>

漏斗圖

from pyecharts.charts import Funnel # 舊版的pyecharts不需要.charts即可import import pyecharts.options as opts
from IPython.display import Image as IMG
from pyecharts import options as opts
from pyecharts.charts import Pie

<pyecharts.charts.basic_charts.funnel.Funnel at 0x1b2939d50c8>

數(shù)據(jù)分析

時(shí)間戳轉(zhuǎn)換

dask對于時(shí)間戳的支持非常不友好

type(data)

dask.dataframe.core.DataFrame

data['Ts1']=data['Ts'].apply(lambda x: time.strftime("%Y-%m-%d %H:%M:%S",
time.localtime(x)))
data['Ts2']=data['Ts'].apply(lambda x: time.strftime("%Y-%m-%d", time.localtime(x)))
data['Ts3']=data['Ts'].apply(lambda x: time.strftime("%H:%M:%S", time.localtime(x)))

D:anacondalibsite-packagesdaskdataframecore.py:3701: UserWarning:
You did not provide metadata, so Dask is running your function on a small dataset to
guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the
`meta=` keyword, as described in the map or apply function that you are using.
Before: .apply(func)
After: .apply(func, meta=('Ts', 'object'))
warnings.warn(meta_warning(meta))

data.head(1)

.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}

data.dtypes

U_Id uint32
T_Id uint32
C_Id uint32
Be_type object
Ts int64
Ts1 object
Ts2 object
Ts3 object
dtype: object

抽取一部分?jǐn)?shù)據(jù)來調(diào)試代碼

df = data.head(1000000)
df.head(1)

.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}

用戶流量和購買時(shí)間情況分析

用戶行為統(tǒng)計(jì)表

describe = df.loc[:,["U_Id","Be_type"]]
ids = pd.DataFrame(np.zeros(len(set(list(df["U_Id"])))),index=set(list(df["U_Id"])))
pv_class=describe[describe["Be_type"]=="pv"].groupby("U_Id").count()
pv_class.columns = ["pv"]
buy_class=describe[describe["Be_type"]=="buy"

CDA數(shù)據(jù)分析師考試相關(guān)入口一覽（建議收藏）：

? 想報(bào)名CDA認(rèn)證考試，點(diǎn)擊>>> “CDA報(bào)名” 了解CDA考試詳情；

? 想學(xué)習(xí)CDA考試教材，點(diǎn)擊>>> “CDA教材” 了解CDA考試詳情；

? 想加入CDA考試題庫，點(diǎn)擊>>> “CDA題庫” 了解CDA考試詳情；

? 想了解CDA考試含金量，點(diǎn)擊>>> “CDA含金量” 了解CDA考試詳情；

echarts 數(shù)據(jù)分析 anaconda 缺失值 DataFrame pandas 分區(qū) python

數(shù)據(jù)分析咨詢請掃描二維碼

若不方便掃碼，搜微信號：CDAshujufenxi

上一篇“大數(shù)據(jù)”告訴你，在中國找個(gè)身高1米7的老公，到底有多難？

下一篇全文2500字，詳解Pandas與Lambda結(jié)合進(jìn)行高效數(shù)據(jù)分析

CDA報(bào)考指南

報(bào)考流程
考試時(shí)間
報(bào)名費(fèi)用
聯(lián)系我們

數(shù)據(jù)分析學(xué)習(xí)

數(shù)據(jù)分析師資訊

京公網(wǎng)安備 11010802034615號經(jīng)營許可證編號：京B2-20210330

聯(lián)系電話：13321103290 (微信同號)

CDA教材
CDA題庫
CDA大綱

客服在線

立即咨詢

客服在線

立即咨詢

免密碼登錄

提交首次登錄驗(yàn)證后自動注冊

') } function initGt() { var handler = function (captchaObj) { captchaObj.appendTo('#captcha'); captchaObj.onReady(function () { $("#wait").hide(); }).onSuccess(function(){ $('.getcheckcode').removeClass('dis'); $('.getcheckcode').trigger('click'); }); window.captchaObj = captchaObj; }; $('#captcha').show(); $.ajax({ url: "/login/gtstart?t=" + (new Date()).getTime(), // 加隨機(jī)數(shù)防止緩存 type: "get", dataType: "json", success: function (data) { $('#text').hide(); $('#wait').show(); // 調(diào)用 initGeetest 進(jìn)行初始化 // 參數(shù)1：配置參數(shù) // 參數(shù)2：回調(diào)，回調(diào)的第一個(gè)參數(shù)驗(yàn)證碼對象，之后可以使用它調(diào)用相應(yīng)的接口 initGeetest({ // 以下 4 個(gè)配置參數(shù)為必須，不能缺少 gt: data.gt, challenge: data.challenge, offline: !data.success, // 表示用戶后臺檢測極驗(yàn)服務(wù)器是否宕機(jī) new_captcha: data.new_captcha, // 用于宕機(jī)時(shí)表示是新驗(yàn)證碼的宕機(jī) product: "float", // 產(chǎn)品形式，包括：float，popup width: "280px", https: true // 更多配置參數(shù)說明請參見：http://docs.geetest.com/install/client/web-front/ }, handler); } }); } function codeCutdown() { if(_wait == 0){ //倒計(jì)時(shí)完成 $(".getcheckcode").removeClass('dis').html("重新獲取"); }else{ $(".getcheckcode").addClass('dis').html("重新獲取("+_wait+"s)"); _wait--; setTimeout(function () { codeCutdown(); },1000); } } function inputValidate(ele,telInput) { var oInput = ele; var inputVal = oInput.val(); var oType = ele.attr('data-type'); var oEtag = $('#etag').val(); var oErr = oInput.closest('.form_box').next('.err_txt'); var empTxt = '請輸入'+oInput.attr('placeholder')+'！'; var errTxt = '請輸入正確的'+oInput.attr('placeholder')+'！'; var pattern; if(inputVal==""){ if(!telInput){ errFun(oErr,empTxt); } return false; }else { switch (oType){ case 'login_mobile': pattern = /^1[3456789]\d{9}$/; if(inputVal.length==11) { $.ajax({ url: '/login/checkmobile', type: "post", dataType: "json", data: { mobile: inputVal, etag: oEtag, page_ur: window.location.href, page_referer: document.referrer }, success: function (data) { } }); } break; case 'login_yzm': pattern = /^\d{6}$/; break; } if(oType=='login_mobile'){ } if(!!validateFun(pattern,inputVal)){ errFun(oErr,'') if(telInput){ $('.getcheckcode').removeClass('dis'); } }else { if(!telInput) { errFun(oErr, errTxt); }else { $('.getcheckcode').addClass('dis'); } return false; } } return true; } function errFun(obj,msg) { obj.html(msg); if(msg==''){ $('.login_submit').removeClass('dis'); }else { $('.login_submit').addClass('dis'); } } function validateFun(pat,val) { return pat.test(val); }

99999久久久久久亚洲,欧美人与禽猛交狂配,高清日韩av在线影院,一个人在线高清免费观看,啦啦啦在线视频免费观看www

干貨 | 數(shù)據(jù)分析實(shí)戰(zhàn)案例——用戶行為預(yù)測

案例介紹

讀取數(shù)據(jù)

餅圖

漏斗圖

數(shù)據(jù)分析

數(shù)據(jù)分析師考試動態(tài)

CDA報(bào)考指南

數(shù)據(jù)分析學(xué)習(xí)

數(shù)據(jù)分析師資訊

【CDA干貨】SQL Server 中 CONVERT 函數(shù)的日期轉(zhuǎn)換 ...

【CDA干貨】MySQL 大表拆分與關(guān)聯(lián)查詢效率：打破 “ ...

CDA 數(shù)據(jù)分析師：表結(jié)構(gòu)數(shù)據(jù) “獲取 - 加工 - 使用 ...

【CDA干貨】DSGE 模型中的 Et：理性預(yù)期算子的內(nèi)涵 ...

【CDA干貨】Python 提取 TIF 中地名的完整指南 ...

CDA 數(shù)據(jù)分析師：解鎖表結(jié)構(gòu)數(shù)據(jù)特征價(jià)值的專業(yè)核心 ...

【CDA干貨】Excel 導(dǎo)入數(shù)據(jù)含缺失值？詳解 dropna ...

【CDA干貨】深入解析卡方檢驗(yàn)與 t 檢驗(yàn)：差異、適用 ...

CDA 數(shù)據(jù)分析師：掌控表格結(jié)構(gòu)數(shù)據(jù)全功能周期的專業(yè) ...

【CDA干貨】MySQL 執(zhí)行計(jì)劃中 rows 數(shù)量的準(zhǔn)確性解 ...

【CDA干貨】解析 Python 中 Response 對象的 text ...

CDA 數(shù)據(jù)分析師：激活表格結(jié)構(gòu)數(shù)據(jù)價(jià)值的核心操盤手 ...

【CDA干貨】Python HTTP 請求工具對比：urllib.requ ...

【CDA干貨】解決 pd.read\_csv 讀取長浮點(diǎn)數(shù)據(jù)的科 ...

CDA 數(shù)據(jù)分析師：業(yè)務(wù)數(shù)據(jù)分析步驟的落地者與價(jià)值優(yōu) ...

【CDA干貨】用 SQL 驗(yàn)證業(yè)務(wù)邏輯：從規(guī)則拆解到數(shù)據(jù) ...

【CDA干貨】塔吉特百貨孕婦營銷案例：數(shù)據(jù)驅(qū)動下的 ...

CDA 數(shù)據(jù)分析師與戰(zhàn)略 / 業(yè)務(wù)數(shù)據(jù)分析：概念辨析與 ...

【CDA干貨】Excel 數(shù)據(jù)聚類分析：從操作實(shí)踐到業(yè)務(wù) ...

【CDA干貨】統(tǒng)計(jì)模型的核心目的：從數(shù)據(jù)解讀到?jīng)Q策 ...

CDA教育閉環(huán)

常見問題

關(guān)于我們

CDA數(shù)據(jù)分析師公眾號

CDA考試中心小程序

CDA數(shù)據(jù)分析師App下載