統計R中字符串中英文單詞的個數

Question

我想計算一串文本中英文單詞的數量。

df.words <- data.frame(ID = 1:2,
              text = c(c("frog friend fresh frink foot"),
                       c("get give gint gobble")))

df.words

  ID                         text
1  1 frog friend fresh frink foot
2  2         get give gint gobble

我希望最終產品看起來像這樣：

  ID                         text count
1  1 frog friend fresh frink foot     4
2  2         get give gint gobble     3

我猜我必須首先根據空格分開，然后根據字典引用這些單詞？

Answer 1

基於@r2evans 關於使用strsplit()和使用隨機英文 word.txt 文件在線詞典的建議，示例如下。 如果由於未unnest步驟而進行大量比較，此解決方案可能無法很好地擴展。

library(dplyr)
library(tidyr)

# text file with 479k English words ~4MB
dict <- read.table(file = url("https://github.com/dwyl/english-words/raw/master/words_alpha.txt"), col.names = "text2")

df.words <- data.frame(ID = 1:2,
                       text = c(c("frog friend fresh frink foot"),
                                c("get give gint gobble")),
                       stringsAsFactors = FALSE)

df.words %>% 
  mutate(text2 = strsplit(text, split = "\\s")) %>% 
  unnest(text2) %>% 
  semi_join(dict, by = c("text2")) %>% 
  group_by(ID, text) %>% 
  summarise(count = length(text2))

Output

     ID text                         count
  <int> <chr>                        <int>
1     1 frog friend fresh frink foot     4
2     2 get give gint gobble             3

Answer 2

基本 R 替代方案，使用 EJJ 對dict的極好推薦：

sapply(strsplit(df.words$text, "\\s+"),
       function(z) sum(z %in% dict$text2))
# [1] 4 3

我認為這將是速度上的明顯贏家，但顯然一次執行sum(. %in%.)可能有點貴。 （這個數據比較慢。）

更快但不一定更簡單：

words <- strsplit(df.words$text, "\\s+")
words <- sapply(words, `length<-`, max(lengths(words)))
found <- array(words %in% dict$text2, dim = dim(words))
colSums(found)
# [1] 4 3

它比 EJJ 的解決方案快（~ 10-15%），所以如果你需要從中獲得一些性能，這可能只是一件好事。

（警告：使用這個 2 行數據集，EJJ 更快。如果數據大 1000 倍，那么我的第一個解決方案會快一點，而我的第二個解決方案會快兩倍。不過，基准是基准，不要優化代碼如果速度/時間不是關鍵因素，則可用性。）

統計R中字符串中英文單詞的個數

問題描述

2 個解決方案

解決方案1
1 已采納 2021-04-24 19:05:12

解決方案2
1 2021-04-24 20:52:06

統計R中字符串中英文單詞的個數

問題描述

2 個解決方案

解決方案1 1 已采納 2021-04-24 19:05:12

解決方案2 1 2021-04-24 20:52:06

解決方案1
1 已采納 2021-04-24 19:05:12

解決方案2
1 2021-04-24 20:52:06