2018-10-30
閱讀量:
1095
LDA主題模型分析學(xué)習(xí)分享
最近在學(xué)習(xí)關(guān)于LDA模型的知識,做一些細(xì)粒度觀點挖掘。在學(xué)習(xí)過程中發(fā)現(xiàn)原來R中還有一個專門的LDA包。
我用LDA為兩個文本文檔建立了一個主題模型,分別是a和B。文檔a與計算機(jī)科學(xué)高度相關(guān),文檔B與地球科學(xué)高度相關(guān)。然后我用這個命令訓(xùn)練lda。
text<- c(A,B) # introduced above
r <- Corpus(VectorSource(text)) # create corpus object
r <- tm_map(r, tolower) # convert all text to lower case
r <- tm_map(r, removePunctuation)
r <- tm_map(r, removeNumbers)
r <- tm_map(r, removeWords, stopwords("english"))
r.dtm <- TermDocumentMatrix(r, control = list(minWordLength = 3))
my_lda <- LDA(r.dtm,2)
現(xiàn)在我想用my_lda來預(yù)測一個新文檔的上下文,比如C,我想看看它是否與計算機(jī)科學(xué)或地球科學(xué)相關(guān)。我知道如果我用這個代碼來預(yù)測。
x<-C# a new document (a long string) introduced above for prediction
rp <- Corpus(VectorSource(x)) # create corpus object
rp <- tm_map(rp, tolower) # convert all text to lower case
rp <- tm_map(rp, removePunctuation)
rp <- tm_map(rp, removeNumbers)
rp <- tm_map(rp, removeWords, stopwords("english"))
rp.dtm <- TermDocumentMatrix(rp, control = list(minWordLength = 3))
test.topics <- posterior(my_lda,rp.dtm)
可以從我的LDA topicmodel中提取最有可能的術(shù)語,并將這些黑箱數(shù)字名稱替換為您想要的任意數(shù)量的名稱。
> library(topicmodels)
> data(AssociatedPress)
>
> train <- AssociatedPress[1:100]
> test <- AssociatedPress[101:150]
>
> train.lda <- LDA(train,2)
>
> #returns those black box names
> test.topics <- posterior(train.lda,test)$topics
> head(test.topics)
1 2
[1,] 0.57245696 0.427543038
[2,] 0.56281568 0.437184320
[3,] 0.99486888 0.005131122
[4,] 0.45298547 0.547014530
[5,] 0.72006712 0.279932882
[6,] 0.03164725 0.968352746
> #extract top 5 terms for each topic and assign as variable names
> colnames(test.topics) <- apply(terms(train.lda,5),2,paste,collapse=",")
> head(test.topics)
percent,year,i,new,last new,people,i,soviet,states
[1,] 0.57245696 0.427543038
[2,] 0.56281568 0.437184320
[3,] 0.99486888 0.005131122
[4,] 0.45298547 0.547014530
[5,] 0.72006712 0.279932882
[6,] 0.03164725 0.968352746
> #round to one topic if you'd prefer
> test.topics <- apply(test.topics,1,function(x) colnames(test.topics)[which.max(x)])
> head(test.topics)
[1] "percent,year,i,new,last" "percent,year,i,new,last" "percent,year,i,new,last"
[4] "new,people,i,soviet,states" "percent,year,i,new,last" "new,people,i,soviet,states"






評論(0)


暫無數(shù)據(jù)
推薦帖子
0條評論
0條評論
0條評論