[英]R Text mining - how to change texts in R data frame column into several columns with word frequencies?
[英]Text mining in r - Finding most frequently occurring word from a column of string in a data frame in r
如果您使用stringr
,我可以將其視為一個兩步過程。
第一步是從“combineall”列中提取信息,如下所示:
DF2 <- DF %>% stringr::str_glue_data("{rownames(.)} combineall: {combineall}")
基本 R 替代方案將是
do.call(sprintf, c(fmt = "combineall: %s", DF))
然后您可以嘗試以下方法來獲取簡單的 function 來計算單詞
# function to count words in a string
countwords = function(strings){
# remove extra spaces between words
wr = gsub(pattern = " {2,}", replacement=" ", x=strings)
# remove line breaks
wn = gsub(pattern = '\n', replacement=" ", x=wr)
# remove punctuations
ws = gsub(pattern="[[:punct:]]", replacement="", x=wn)
# split the words
wsp = strsplit(ws, " ")
# sort words in table
wst = data.frame(sort(table(wsp, exclude=""), decreasing=TRUE))
wst
}
countwords(DF2)
然后將最常用的詞添加回您的數據中。 希望這是您想要的並且對您有所幫助。
樣本數據:
df <- data.frame(
combineall = c("some words", "more of the same", "again words", "different items", "and more and more")
)
制作頻率表:
freqtable <- sort(table(unlist(strsplit(df$combineall, " "))), decreasing = T)
Select 前 3 個最常用的詞並將它們定義為交替模式:
top3 <- paste0("(", paste0("\\b", names(freqtable)[1:3], "\\b", collapse = ""), ")", collapse = "|")
現在lapply
grep
(參數value = T
)以匹配列combineall
最常見的 3 個單詞:
df$top3 <- lapply(strsplit(df$combineall, " "),
function(x) paste0(grep(top3, x, value = T), collapse = ","))
結果:
top3
現在列出了combineall
中每個字符串中出現的前 3 個項目中的哪些:
df
combineall top3
1 some words words
2 more of the same more
3 again words words
4 different items
5 and more and more and,more,and,more
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.