97久久精品人人澡人人爽,seqingav,国产亚洲精品第一综合

99999久久久久久亚洲,欧美人与禽猛交狂配,高清日韩av在线影院,一个人在线高清免费观看,啦啦啦在线视频免费观看www

前瞻：數(shù)據(jù)科學中的探索性數(shù)據(jù)分析（DEA）

2021-11-16

CDA數(shù)據(jù)分析師出品

作者： tukey

數(shù)據(jù)科學愛好者知道，在將原始數(shù)據(jù)輸入到機器學習模型之前，需要對其進行大量數(shù)據(jù)預處理。為此，需要遵循一系列標準來準備數(shù)據(jù)，具體取決于手頭問題的類型（回歸或分類）。這個過程的一個主要部分涉及以所有可能的方式評估數(shù)據(jù)集，以找到有價值的相關性（彼此和目標之間的特征依賴性）并排除噪聲（不一致或異常值，即不合格的數(shù)據(jù)點）。要探索任何數(shù)據(jù)集，Python 是可用的最強大的數(shù)據(jù)分析工具之一，此外，還有同樣強大的 Python 庫可以更好地可視化數(shù)據(jù)。

因此，為了使數(shù)據(jù)更有意義或從可用數(shù)據(jù)中提取更多價值，必須快速解釋和分析它。這是Python數(shù)據(jù)可視化庫通過生成圖形表示和讓數(shù)據(jù)說話所擅長的地方。通過這種方式，我們可以發(fā)現(xiàn)大量數(shù)據(jù)背后所有可能的趨勢和模式。

今天，數(shù)據(jù)科學和機器學習不僅僅適用于具有強大計算機科學背景的人。相反，歡迎來自不同行業(yè)的專業(yè)人士對數(shù)據(jù)有著相同的熱情，盡管他們具有一些統(tǒng)計知識，但這種趨勢正在增加。這就是為什么來自不同背景和教育背景的人傾向于嘗試數(shù)據(jù)科學和人工智能必須提供的東西。

但是對于剛剛開始使用機器學習的初學者來說，理解數(shù)據(jù)的選擇太多是具有挑戰(zhàn)性的，有時甚至是壓倒性的。我們都希望我們的數(shù)據(jù)看起來很漂亮并且可以展示，以便更快地做出決策?？傮w而言，EDA可能是一個耗時的過程，因為我們仔細查看多個圖以找出哪些特征是重要的并對結(jié)果產(chǎn)生重大影響。此外，我們尋找方法來處理缺失值和/或異常值、修復數(shù)據(jù)集中的不平衡以及許多此類具有挑戰(zhàn)性的任務。因此，在選擇滿足 EDA 需求的最佳庫時，這是一個艱難的選擇。因此，對于任何開始機器學習之旅的人來說，從自動化 EDA 庫開始都是一種很好的學習體驗。這些庫提供了良好的數(shù)據(jù)整體視圖，并且易于使用。只需幾行簡單的 Python 代碼，這些庫就可以節(jié)省時間，并使新手能夠更加專注于了解如何使用這些不同的圖來理解數(shù)據(jù)。但是，初學者肯定需要對這些庫生成的圖有基本的了解。

在本文中，我們將為初學者討論三個有趣的自動EDA Python 庫。對于這個初學者友好的教程，我們將使用來自sklearn 的內(nèi)置“iris”數(shù)據(jù)集。

我們將首先導入包和庫

#loading the datasetfrom sklearn import datasets import pandas as pd print("pandas:",pd. version )

?pandas: 1.3.2

data = datasets.load_iris()df = pd.DataFrame(data.data,columns=data.feature_names) df['target'] = pd.Series(data.target)df.head()

如果我們不使用 AutoEDA，這里有一個通常用于 EDA 的命令列表，用于打印有關 DataFrame/數(shù)據(jù)集的不同信息（不一定按相同的順序）。

df.head() – 前五行
df.tail() – 最后五行
df.describe() – 有關數(shù)據(jù)集的百分位數(shù)、平均值、標準偏差等的基本統(tǒng)計信息
df.info() – 數(shù)據(jù)集摘要
df.shape() – 數(shù)據(jù)集中的觀察值和變量的數(shù)量，即數(shù)據(jù)的維度
df.dtypes() – 變量的數(shù)據(jù)類型（int、?oat、object、datetime）
df.unique()/df.target.unique() – 數(shù)據(jù)集/目標列中的唯一值
df['target'].value_counts() – 分類問題的?標變量分布
df.isnull().sum()- 計算數(shù)據(jù)集中的空值
df.corr() – 相關信息
等等...

查看我們必須使用多少命令才能在數(shù)據(jù)中找到洞察力。AutoEDA 庫可以通過幾行 Python 代碼快速完成所有這些以及更多工作。但在我們開始之前，讓我們先檢查安裝的 Python 版本，因為這些庫需要 Python >=3.6。要獲取版本信息，請在 Colab 中使用以下命令。

# python versionimport sys sys.version

'3.7.6 (default, Jan 8 2020, 19:59:22) n[GCC 7.3.0]'

確認好了符合條件的Python 版本，現(xiàn)在就可以自動進行EDA探索數(shù)據(jù)分析。

01、Pandas Pro?ling 3.0.0

import pandas_profiling print("pandas_profiling:",pandas_profiling. version )

pandas_profiling: 3.0.0

從報告中，初學者可以很容易地理解 iris 數(shù)據(jù)集中有 5 個變量——4 個數(shù)字變量，結(jié)果變量是分類變量。此外，數(shù)據(jù)集中有 150 個樣本并且沒有缺失值。

#Generating PandasProfiling Reportreport = pandas_profiling.ProfileReport(df) report

02、Sweetviz 2.1.3

這也是一個開源 Python 庫，僅使用兩行代碼即可執(zhí)行深入空格的 EDA。該庫為數(shù)據(jù)集生成的報告以 .html 文件形式提供，可以在任何瀏覽器中打開。使用 Sweetviz，我們可以檢查數(shù)據(jù)集特征如何與目標值相關聯(lián)。

可視化測試和訓練數(shù)據(jù)并比較它們。我們可以使用analyze()、compare() 或compare_intra() 來評估數(shù)據(jù)并生成報告繪制數(shù)值和分類變量的相關性。

總結(jié)有關缺失值、重復數(shù)據(jù)條目和頻繁條目的信息以及數(shù)值分析，即解釋統(tǒng)計值與前面的部分類似，我們將首先導入 pandas 來讀取和處理數(shù)據(jù)集。

接下來，我們只需導入 sweetviz 來探索數(shù)據(jù)。

import sweetviz as sv print("sweetviz :",sv. version )

sweetviz : 2.1.3

這就是經(jīng)典的的 Sweetviz 報告的樣式

#Generating Sweetviz reportreport = sv.analyze(df)report.show_html("iris_EDA_report.html") # specify a name for the report

| | [ 0%] 00:00 -> (? left)Report iris_EDA_report.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop

生成的這些 .html 報告您可以在當前目錄下找到，然后可以在瀏覽器中打開報告。

03、AutoViz 0.0.83

另一個開源 Python EDA 庫，只需一行代碼即可快速分析任何數(shù)據(jù)。

# pip install autoviz# pip install wordcloud

from autoviz.AutoViz_Class import AutoViz_ClassAV = AutoViz_Class()

Imported AutoViz_Class version: 0.0.84. Call using: AV = AutoViz_Class()AV.AutoViz(filename, sep=',', depVar='', dfte=None, header=0, verbose=0,lowess=False,chart_format='svg',max_rows_analyzed=150000,max_cols Note: verbose=0 or 1 generates charts and displays them in your local Jupyter notebook.verbose=2 does not show plot but creates them and saves them in AutoViz_Plots directory

由于我們使用的是庫中的數(shù)據(jù)集，因此我們使用 'dfte' 選項而不是 EDA 的文件名。

#Generating AutoViz Report #this is the default command when using a file for the datasetfilename = "" sep = ","dft = AV.AutoViz( filename,sep=",",

depVar="", dfte=None, header=0, verbose=0, lowess=False, chart_format="svg",max_rows_analyzed=150000,max_cols_analyzed=30,)

Dataname input must be a filename with path to that file or a Dataframe Not able to read or load file. Please check your inputs and try again...

#Generating AutoViz Reportfilename = "" # empty string ("") as filename since no file is being used for the datasep = ","dft = AV.AutoViz( '',sep=",",depVar="", dfte=df, header=0,verbose=0, lowess=False, chart_format="svg",max_rows_analyzed=150000,max_cols_analyzed=30,

Shape of your Data Set loaded: (150, 5)############## C L A S S I F Y I N G V A R I A B L E S ####################Classifying variables in data set...Number of Numeric Columns = 4Number of Integer-Categorical Columns = 1 Number of String-Categorical Columns = 0 Number of Factor-Categorical Columns = 0 Number of String-Boolean Columns = 0 Number of Numeric-Boolean Columns = 0 Number of Discrete String Columns = 0 Number of NLP String Columns = 0Number of Date Time Columns = 0 Number of ID Columns = 0Number of Columns to Delete = 05 Predictors classified...This does not include the Target column(s)No variables removed since no ID or low-information variables found in data set Number of All Scatter Plots = 10

depVar="", dfte=None, header=0, verbose=0, lowess=False, chart_format="svg",max_rows_analyzed=150000,max_cols_analyzed=30,)

Dataname input must be a filename with path to that file or a Dataframe Not able to read or load file. Please check your inputs and try again...

Number of Columns to Delete = 05 Predictors classified...This does not include the Target column(s)No variables removed since no ID or low-information variables found in data set Number of All Scatter Plots = 10

Time to run AutoViz (in seconds) = 6.979###################### VISUALIZATION Completed ########################

AutoViz 報告包括有關數(shù)據(jù)集形狀的信息以及所有可能的圖表，包括條形圖、小提琴圖、相關矩陣（熱圖）、配對圖等。所有這些信息與一行代碼肯定對任何初學者都有用。

因此，我們使用三個 AutoEDA 庫以最少的代碼自動化了一個小數(shù)據(jù)集的數(shù)據(jù)分析。以上所有代碼都可以在原文鏈接中訪問。

結(jié)語

從初學者的?度來看，Pandas Pro?ling、Sweetviz 和 AutoViz 似乎是最簡單的生成報告以及呈現(xiàn)數(shù)據(jù)集洞察力的工具。在開始做數(shù)據(jù)探索時，我經(jīng)常使用這些庫以最少的代碼快速發(fā)現(xiàn)有趣的數(shù)據(jù)規(guī)律和趨勢。希望對你有用！

CDA數(shù)據(jù)分析師考試相關入口一覽（建議收藏）：

? 想報名CDA認證考試，點擊>>> “CDA報名” 了解CDA考試詳情；