R：计算预定义词典中单词的频率

Question

I have a very large dataset that looks like this: one column contains names, the second column contains their respective (very long) texts.我有一个非常大的数据集，如下所示：一列包含名称，第二列包含它们各自的（非常长的）文本。 I also have a pre-defined dictionary that contains at least 20 terms.我还有一个预定义的字典，其中至少包含 20 个术语。 How can I count the number of times these key words occur in each row of my dataframe?如何计算这些关键词在我的 dataframe 的每一行中出现的次数？ I have tried str_detect,grep(l), and %>% like, and looped over each row, but the problem seems to be that I want to detect too many terms, and these functions stop working when I use 15+ terms or so.我已经尝试过 str_detect、grep(l) 和 %>% like，并在每一行上循环，但问题似乎是我想检测太多术语，当我使用 15+ 个术语左右时这些函数停止工作.

Would be sooo happy if anyone could help me out with this!如果有人能帮我解决这个问题，我会很高兴！

col1<- c("Henrik", "Joseph", "Lucy")
col2 <- c("I am going to get groceries", "He called me at six.", "No, he did not")
df <- data.frame(col1, col2)```
dict <- c("groceries", "going", "me") #but my actual dictionary is much larger

Answer 1

Create a unique identifier for your rows.为您的行创建一个唯一标识符。 Split your col2 by words, one in each row.按单词拆分 col2，每行一个。 Filter for only the select words in your dict.仅过滤您的字典中的 select 单词。 Then count by each row.然后按每一行计算。 Finally, combine with original df and set NA to Zeros for rows that don't have any words from your dict.最后，结合原始 df 并将 NA 设置为 Zeros 对于没有来自您的 dict 的任何单词的行。

library(dplyr)

col1 <- c("A","B","A")
col2 <- c("I am going to get groceries", "He called me at six.", "No, he did not")
df <- data.frame(col1, col2, stringsAsFactors = FALSE)
dict <- c("groceries", "going", "me")

df <- df %>% mutate(row=row_number()) %>% select(row, everything())

counts <- df %>% tidyr::separate_rows(col2) %>% filter(col2 %in% dict) %>% group_by(row) %>% count(name = "counts")

final <- left_join(df, counts, by="row") %>% tidyr::replace_na(list(counts=0L))
final
#>   row col1                        col2 counts
#> 1   1    A I am going to get groceries      2
#> 2   2    B        He called me at six.      1
#> 3   3    A              No, he did not      0

Answer 2

Here is a base R option using gregexpr这是使用gregexpr的基本 R 选项

dfout <- within(
  df,
  counts <- sapply(
    gregexpr(paste0(dict, collapse = "|"), col2),
    function(x) sum(x > 0)
  )
)

or或者

dfout <- within(
  df,
  counts <- sapply(
    regmatches(col2, gregexpr("\\w+", col2)),
    function(v) sum(v %in% dict)
  )
)

which gives这使

> dfout
  col1                        col2 counts
1    1 I am going to get groceries      2
2    2        He called me at six.      1
3    3              No, he did not      0

Data数据

structure(list(col1 = 1:3, col2 = c("I am going to get groceries", 
"He called me at six.", "No, he did not")), class = "data.frame", row.names = c(NA, 
-3L))

Answer 3

I think my solution gives you the output you want - that is for each word in your "dict" list, you can see how many times it appears in each sentence.我认为我的解决方案为您提供了您想要的 output - 对于“dict”列表中的每个单词，您可以看到它在每个句子中出现了多少次。 Each row is an entry in df$col2 ie a sentence.每行是 df$col2 中的一个条目，即一个句子。 "Dict" is your vector of terms that you're looking to match. “字典”是您要匹配的术语向量。 We can loop over the vector and for each entry in the vector we match how many times that entry appears in each row/sentence using stringr::str_count.我们可以遍历向量，并且对于向量中的每个条目，我们使用 stringr::str_count 匹配该条目在每行/句子中出现的次数。 Note the syntax for str_count: str_count(string being checked over, expression you're trying to match)请注意 str_count 的语法： str_count(正在检查的字符串，您尝试匹配的表达式)

str_count returns a vector showing how many times the word appears in each row. str_count 返回一个向量，显示单词在每行中出现的次数。 I create a data frame of these vectors which will contain the same number of rows as there are entries in the dict vector.我创建了这些向量的数据框，其中包含与 dict 向量中的条目相同的行数。 Then you can just cbind "dict" to that data frame and you can see how many times each word is used in each sentence.然后您可以将“dict”绑定到该数据框，您可以查看每个单词在每个句子中使用了多少次。 I adjust the column names at very end so you can match the words to the sentence #'s.我在最后调整了列名，以便您可以将单词与句子#s 匹配。 Note that if you want to calculate row means you'll need to subset out the "dict" column of the final data frame because it's character.请注意，如果要计算行意味着您需要将最终数据框的“dict”列子集化，因为它是字符。

 library(stringr)
 col1<- c("Henrik", "Joseph", "Lucy")
 col2 <- c("I am going to get groceries", "He called me at six.", "No, he    
 did not")
 df <- data.frame(col1, col2)
 dict <- c("groceries", "going", "me")

 word_matches <- data.frame()
 for (i in dict) {
 word_tot<-(str_count(df$col2, i))
 word_matches <- rbind(word_matches,word_tot)
 }
 word_matches
 colnames(word_matches) <- paste("Sentence", 1:ncol(word_matches))
 cbind(dict,word_matches)


        dict Sentence 1    Sentence 2    Sentence 3
 1 groceries        1           0           0
 2     going        1           0           0
 3        me        0           1           0

R：计算预定义词典中单词的频率

问题描述

3 个解决方案

解决方案1
0 已采纳 2020-08-07 11:02:39

解决方案2
0 2020-08-07 11:55:59

解决方案3
0 2020-08-07 18:28:31

R：计算预定义词典中单词的频率

问题描述

3 个解决方案

解决方案1 0 已采纳 2020-08-07 11:02:39

解决方案2 0 2020-08-07 11:55:59

解决方案3 0 2020-08-07 18:28:31

解决方案1
0 已采纳 2020-08-07 11:02:39

解决方案2
0 2020-08-07 11:55:59

解决方案3
0 2020-08-07 18:28:31