R語言作為BI中ETL的工具
R語言作為BI中ETL的工具,增刪改
R語言提供了強(qiáng)大的R_package與各種數(shù)據(jù)庫進(jìn)行數(shù)據(jù)交互。
外加其強(qiáng)大數(shù)據(jù)變換清洗函數(shù),為ETL提供一條方便快捷的道路。
RODBC
ROracal
RMysql
Rmongodb
step1 新建連接con,并查看其信息
library(RODBC)
con<-odbcConnect("LI")
con
RODBC Connection 1
Details:
case=nochange
DSN=LI
UID=
Trusted_Connection=Yes
APP=RStudio
WSID=LIYI-PC
step2 引入數(shù)據(jù)集USArrests
data(USArrests)
head(USArrests)
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
step3 將USArrests保存到sqlserver
sqlSave(con, USArrests)
sqlSave(channel, dat, tablename = NULL, append = FALSE,
rownames = TRUE, colnames = FALSE, verbose = FALSE,
safer = TRUE, addPK = FALSE, typeInfo, varTypes,
fast = TRUE, test = FALSE, nastring = NULL)
channel 數(shù)據(jù)庫連接通道
dat data frame.要存入的數(shù)據(jù)集
tablename character 數(shù)據(jù)庫中表名
index character 索引列的名字
append logical邏輯變量 ,是否數(shù)據(jù)集添加/寫入已存在的表
rownames logical 邏輯變量 or character字符串,如果是logical,表示是否把rowname這個(gè)字符串作為數(shù)據(jù)庫表首列列名,如果是字符串,則將新字符串作為表首列列名
colnames logical 邏輯變量 是否將數(shù)據(jù)集的列名保留作為表的首行 (謹(jǐn)慎更改,可能導(dǎo)致,列名變成數(shù)據(jù)第一行,各列的
數(shù)據(jù)類型全部變?yōu)関archar(255)
verbose display statements as they are sent to the server?
safer logical邏輯變量.如果真,生成一個(gè)新表,不在已存在的表后添加。如果假,強(qiáng)制刪除已存在的同名表并新建,或者刪除表中已存在的數(shù)據(jù),覆蓋寫入
addPK logical邏輯變量,是否將首列作為主鍵
typeInfo list 數(shù)據(jù)框中數(shù)據(jù)類型。包括character ,double ,integer
varTypes an optional named character vector giving the DBMSs datatypes to be used for some (or all) of the columns if a table is to be created.可選項(xiàng)涉及各列數(shù)據(jù)類型轉(zhuǎn)換,因?yàn)閿?shù)據(jù)庫中的數(shù)據(jù)類型比
R語言中的要多很多。
fast logical. If false, write data a row at a time. If true, use a parametrized INSERT INTO or UPDATE query to write all the data in one operation. 邏輯變量,如果F,數(shù)據(jù)將一次一行地寫入,如果為Ture,將用到變量插入INSERT INTO,或者UPDATE 將數(shù)據(jù)一次性寫入
nastring optional character string to be used for writing NAs to the database. 選擇哪種字符串,將缺失項(xiàng)在數(shù)據(jù)庫中填充
getSqlTypeInfo("Microsoft
SQL Server")
$double
[1] "float"
$integer
[1] "int"
$character
[1] "varchar(255)"
$logical
[1] "varchar(5)"
sqlSave(con, USArrests,rownames = "city", addPK = T) # 原沒有列名的rownames改名為city,并設(shè)置首列為主鍵key

sqlSave(con, USArrests,'USA2',rownames = "city", addPK = T,fast=T,test=T) #注意此操作可能在數(shù)據(jù)庫中create名為USA2的空表
Binding: 'city' DataType 12, ColSize 255
Binding: 'Murder' DataType 6, ColSize 53
Binding: 'Assault' DataType 4, ColSize 10
Binding: 'UrbanPop' DataType 4, ColSize 10
Binding: 'Rape' DataType 6, ColSize 53
Parameters:
no: 1: city Alabama/***/no: 2: Murder 13.2/***/no: 3: Assault 236/***/no: 4: UrbanPop 58/***/no: 5: Rape 21.2/***/
no: 1: city Alaska/***/no: 2: Murder 10/***/no: 3: Assault 263/***/no: 4: UrbanPop 48/***/no: 5: Rape 44.5/***/
no: 1: city Arizona/***/no: 2: Murder 8.1/***/no: 3: Assault 294/***/no: 4: UrbanPop 80/***/no: 5: Rape 31/***/
no: 1: city Arkansas/***/no: 2: Murder 8.8/***/no: 3: Assault 190/***/no: 4: UrbanPop 50/***/no: 5: Rape 19.5/***/
no: 1: city California/***/no: 2: Murder 9/***/no: 3: Assault 276/***/no: 4: UrbanPop 91/***/no: 5: Rape 40.6/***/
no: 1: city Colorado/***/no: 2: Murder 7.9/***/no: 3: Assault 204/***/no: 4: UrbanPop 78/***/no: 5: Rape 38.7/***/
no: 1: city Connecticut/***/no: 2: Murder 3.3/***/no: 3: Assault 110/***/no: 4: UrbanPop 77/***/no: 5: Rape 11.1/***/
no: 1: city Delaware/***/no: 2: Murder 5.9/***/no: 3: Assault 238/***/no: 4: UrbanPop 72/***/no: 5: Rape 15.8/***/
no: 1: city Florida/***/no: 2: Murder 15.4/***/no: 3: Assault 335/***/no: 4: UrbanPop 80/***/no: 5: Rape 31.9/***/
no: 1: city Georgia/***/no: 2: Murder 17.4/***/no: 3: Assault 211/***/no: 4: UrbanPop 60/***/no: 5: Rape 25.8/***/
no: 1: city Hawaii/***/no: 2: Murder 5.3/***/no: 3: Assault 46/***/no: 4: UrbanPop 83/***/no: 5: Rape 20.2/***/
no: 1: city Idaho/***/no: 2: Murder 2.6/***/no: 3: Assault 120/***/no: 4: UrbanPop 54/***/no: 5: Rape 14.2/***/
no: 1: city Illinois/***/no: 2: Murder 10.4/***/no: 3: Assault 249/***/no: 4: UrbanPop 83/***/no: 5: Rape 24/***/
no: 1: city Indiana/***/no: 2: Murder 7.2/***/no: 3: Assault 113/***/no: 4: UrbanPop 65/***/no: 5: Rape 21/***/
no: 1: city Iowa/***/no: 2: Murder 2.2/***/no: 3: Assault 56/***/no: 4: UrbanPop 57/***/no: 5: Rape 11.3/***/
no: 1: city Kansas/***/no: 2: Murder 6/***/no: 3: Assault 115/***/no: 4: UrbanPop 66/***/no: 5: Rape 18/***/
# 此處省略10000字
sqlColumns Enquire about the column structure of tables on an ODBC database connection. 訪問數(shù)據(jù)庫表的結(jié)構(gòu)
columnsenquire<-sqlColumns(con,'USA2')
str(columnsenquire)
str(columnsenquire)
'data.frame': 5 obs. of 29 variables:
$ TABLE_CAT : chr "master" "master" "master" "master" ...
$ TABLE_SCHEM : chr "dbo" "dbo" "dbo" "dbo" ...
$ TABLE_NAME : chr "USA2" "USA2" "USA2" "USA2" ...
$ COLUMN_NAME : chr "city" "Murder" "Assault" "UrbanPop" ...
$ DATA_TYPE : int 12 6 4 4 6
$ TYPE_NAME : chr "varchar" "float" "int" "int" ...
$ COLUMN_SIZE : int 255 53 10 10 53
$ BUFFER_LENGTH : int 255 8 4 4 8
$ DECIMAL_DIGITS : int NA NA 0 0 NA
$ NUM_PREC_RADIX : int NA 2 10 10 2
$ NULLABLE : int 0 1 1 1 1
$ REMARKS : chr NA NA NA NA ...
$ COLUMN_DEF : chr NA NA NA NA ...
$
SQL_DATA_TYPE : int 12 6 4 4 6
$
SQL_DATETIME_SUB : int NA NA NA NA NA
$ CHAR_OCTET_LENGTH : int 255 NA NA NA NA
$ ORDINAL_POSITION : int 1 2 3 4 5
$ IS_NULLABLE : chr "NO" "YES" "YES" "YES" ...
$ SS_IS_SPARSE : int 0 0 0 0 0
$ SS_IS_COLUMN_SET : int 0 0 0 0 0
$ SS_IS_COMPUTED : int 0 0 0 0 0
$ SS_IS_IDENTITY : int 0 0 0 0 0
$ SS_UDT_CATALOG_NAME : chr NA NA NA NA ...
$ SS_UDT_SCHEMA_NAME : chr NA NA NA NA ...
$ SS_UDT_ASSEMBLY_TYPE_NAME : chr NA NA NA NA ...
$ SS_XML_SCHEMACOLLECTION_CATALOG_NAME: chr NA NA NA NA ...
$ SS_XML_SCHEMACOLLECTION_SCHEMA_NAME : chr NA NA NA NA ...
$ SS_XML_SCHEMACOLLECTION_NAME : chr NA NA NA NA ...
$ SS_DATA_TYPE : chr "39" "109" "38" "38" ...
step4 查詢數(shù)據(jù)
sqlQuery(con,'select * from USArrests')
# 注意此時(shí)第一列的名字已經(jīng)為city了
city Murder Assault UrbanPop Rape
1 Alabama 13.2 236 58 21.2
2 Alaska 10.0 263 48 44.5
3 Arizona 8.1 294 80 31.0
4 Arkansas 8.8 190 50 19.5
5 California 9.0 276 91 40.6
6 Colorado 7.9 204 78 38.7
7 Connecticut 3.3 110 77 11.1
8 Delaware 5.9 238 72 15.8
9 Florida 15.4 335 80 31.9
10 Georgia 17.4 211 60 25.8
11 Hawaii 5.3 46 83 20.2
12 Idaho 2.6 120 54 14.2
13 Illinois 10.4 249 83 24.0
14 Indiana 7.2 113 65 21.0
...
對于sql語句可能是以‘XX’ 結(jié)尾則需要用如下形式來進(jìn)行查詢
sqlQuery(con,paste('select * from USArrests',
"where city = 'Alabama'"))
city Murder Assault UrbanPop Rape
1 Alabama 13.2 236 58 21.2
但是對于Update,以下卻是失效的
a<-paste("update [master].[dbo].[USArrests]",
"set Murder =13.2","where city ='Alabama'")
sqlQuery(con,a) # 失效
sqlUpdate(con,a) # 失效
sqlUpdate() sqlUpdate(channel, dat, tablename = NULL, index = NULL,
verbose = FALSE, test = FALSE, nastring = NULL,
fast = TRUE)
不能進(jìn)行腳本語句直接更新,但是可以進(jìn)行如下操作
foo <- cbind(city=row.names(USArrests), USArrests)[1:3, c(1,3)]
foo
city Assault
Alabama Alabama 236
Alaska Alaska 263
Arizona Arizona 294
foo[1,2] <- 200
foo
city Assault
Alabama Alabama 200
Alaska Alaska 263
Arizona Arizona 294
流程是先選定要更新的行列,將值更新,然后再將值update入庫
實(shí)例如下
temp<-sqlQuery(con,paste('select * from USArrests',
"where city = 'Alabama'"))
temp
city Murder Assault UrbanPop Rape
1 Alabama 13.2 300 58 21.2
str(temp)
'data.frame': 1 obs. of 5 variables:
$ city : Factor w/ 1 level "Alabama": 1
$ Murder : num 13.2
$ Assault : num 300
$ UrbanPop: int 58
$ Rape : num 21.2
temp[1,]
[1] 300
temp[1,3]<-200
sqlUpdate(con, temp, "USArrests")
sqlQuery(con,paste('select * from USArrests',
"where city = 'Alabama'"))
city Murder Assault UrbanPop Rape
1 Alabama 13.2 200 58 21.2
sqlFetch(con, "USArrests", rownames = "city", max = 20,rows_at_time = 10)
實(shí)踐后發(fā)現(xiàn),單單對于簡單的
ETL,sqlQuery,sqlUpdate是足夠了,
寫一些for循環(huán)+list.files/liset.dir+reshape/ddplyr/tidyr(進(jìn)行數(shù)據(jù)篩選,清洗,變換),
對于腳本是否執(zhí)行的問題,可以寫日志文件,對
ETL過程進(jìn)行檢測。
step5 構(gòu)建日志文件
監(jiān)督機(jī)制,對于每次要寫入/更新的數(shù)據(jù)進(jìn)行count(originaldata),保存數(shù)據(jù),然后與每次執(zhí)行寫入/變更的行數(shù)Rows進(jìn)行比對,比對結(jié)果result為邏輯變量。將三者連同 執(zhí)行時(shí)間,寫成一個(gè)
數(shù)據(jù)框,并在數(shù)據(jù)中新建對應(yīng)的日志表,對執(zhí)行結(jié)果進(jìn)行監(jiān)督。(對于追蹤Rows沒有找到很好的解決方法,send a letter ...)
暴力美學(xué),對于sqlSave / sqlQuery 的執(zhí)行結(jié)果進(jìn)行追蹤,成功為“1”,失敗則"ON Errors"
step6 快速操作數(shù)據(jù)庫
sqlClear deletes all the rows of the table sqtable. #清楚表中數(shù)據(jù)
sqlDrop removes the table sqtable (if permitted). #刪除表
sqlClear(channel, sqtable, errors = TRUE)
sqlDrop(channel, sqtable, errors = TRUE)
CDA數(shù)據(jù)分析師考試相關(guān)入口一覽(建議收藏):
? 想報(bào)名CDA認(rèn)證考試,點(diǎn)擊>>>
“CDA報(bào)名”
了解CDA考試詳情;
? 想學(xué)習(xí)CDA考試教材,點(diǎn)擊>>> “CDA教材” 了解CDA考試詳情;
? 想加入CDA考試題庫,點(diǎn)擊>>> “CDA題庫” 了解CDA考試詳情;
? 想了解CDA考試含金量,點(diǎn)擊>>> “CDA含金量” 了解CDA考試詳情;