r 中的文本挖掘 - 從 r 的數據框中的字符串列中查找最常出現的單詞

Question

有沒有辦法找出 r 數據框中的字符串列中最常用的單詞？ 我遇到了很多使用文本語料庫執行此操作的功能，但沒有一個用於數據框架的功能。 我需要為數據框執行此操作，以便為產品創建“元數據”。 下面是我擁有的數據和我試圖達到的結果的一個例子。 非常感謝任何幫助。 謝謝！

雜貨店的產品數據

現在我想從“combineall”列中找到最常出現的單詞，並在旁邊的新列中列出這些單詞。 基本上我正在嘗試從產品描述中創建元數據。 再次感謝！

Answer 1

如果您使用stringr ，我可以將其視為一個兩步過程。

第一步是從“combineall”列中提取信息，如下所示：

DF2 <- DF %>% stringr::str_glue_data("{rownames(.)} combineall: {combineall}")

基本 R 替代方案將是

do.call(sprintf, c(fmt = "combineall: %s", DF))

然后您可以嘗試以下方法來獲取簡單的 function 來計算單詞

# function to count words in a string
countwords = function(strings){
  
  # remove extra spaces between words
  wr = gsub(pattern = " {2,}", replacement=" ", x=strings)
  
  # remove line breaks
  wn = gsub(pattern = '\n', replacement=" ", x=wr)
  
  # remove punctuations
  ws = gsub(pattern="[[:punct:]]", replacement="", x=wn)
  
  # split the words
  wsp =  strsplit(ws, " ")
  
  # sort words in table
  wst = data.frame(sort(table(wsp, exclude=""), decreasing=TRUE))
  wst
}
countwords(DF2)

然后將最常用的詞添加回您的數據中。 希望這是您想要的並且對您有所幫助。

Answer 2

樣本數據：

df <- data.frame(
  combineall = c("some words", "more of the same", "again words", "different items", "and more and more")
)

制作頻率表：

freqtable <- sort(table(unlist(strsplit(df$combineall, " "))), decreasing = T)

Select 前 3 個最常用的詞並將它們定義為交替模式：

top3 <- paste0("(", paste0("\\b", names(freqtable)[1:3], "\\b", collapse = ""), ")", collapse = "|")

現在lapply grep （參數value = T ）以匹配列combineall最常見的 3 個單詞：

df$top3 <- lapply(strsplit(df$combineall, " "), 
                  function(x) paste0(grep(top3, x, value = T), collapse = ","))

結果：

top3現在列出了combineall中每個字符串中出現的前 3 個項目中的哪些：

df
         combineall              top3
1        some words             words
2  more of the same              more
3       again words             words
4   different items                  
5 and more and more and,more,and,more

r 中的文本挖掘 - 從 r 的數據框中的字符串列中查找最常出現的單詞

問題描述

2 個解決方案

解決方案1
0 2020-07-27 05:32:03

解決方案2
0 2020-07-27 12:51:20

r 中的文本挖掘 - 從 r 的數據框中的字符串列中查找最常出現的單詞

問題描述

2 個解決方案

解決方案1 0 2020-07-27 05:32:03

解決方案2 0 2020-07-27 12:51:20

解決方案1
0 2020-07-27 05:32:03

解決方案2
0 2020-07-27 12:51:20