计算 R 中的特定词频

Question

I have a data set where I have split text from the journal abstracts to print 1 word per column.我有一个数据集，其中我将期刊摘要中的文本拆分为每列打印 1 个单词。 This has lead to over 5 million rows, but I just want certain the word counts of certain words.这导致超过 500 万行，但我只想确定某些单词的字数。 Below is an example of the data:以下是数据示例：

1 rna 1个核糖核酸
1 synthesis 1综合
1 resembles 1 类似
1 copy 1 份
1 choice 1 个选择
1 rna 1个核糖核酸
1 recombination 1 重组
1 process 1 个过程
1 nascent 1 新生
1 rna 1个核糖核酸

So in that example let's say I want just the rna counts, I would get 3 and that's it.所以在那个例子中，假设我只想要 rna 计数，我会得到 3，就是这样。 I have done that word count on the whole set but this is not as useful to me.我已经完成了整个系列的字数计算，但这对我来说没有那么有用。

wordCount <- m3 %>% count(word, sort = TRUE) wordCount <- m3 %>% count(word, sort = TRUE)

Since many of the words aren't helpful for what I am trying to get to.由于许多单词对我想要达到的目标没有帮助。

Any help would be welcome.欢迎任何帮助。

Answer 1

You can group_by the word and count occurrences of each unique word and then subset the ones you want.您可以按单词group_by并计算每个唯一单词的出现次数，然后对您想要的单词进行子集化。

library(tidyverse)
data <- data.frame(word = c("rna",
                "synthesis",
                "resembles",
               "copy",
                "choice",
                "rna",
                "recombination",
               "process",
                "nascent",
                "rna"))

counts <- data %>% 
  group_by(word) %>% 
  count()

counts[which(counts$word == "rna"),]

   # A tibble: 1 x 2
# Groups:   word [1]
  word      n
  <fct> <int>
1 rna       3

or using dplyr subsetting:或使用 dplyr 子集：

 counts %>% filter(word == "rna")
# A tibble: 1 x 2
# Groups:   word [1]
  word      n
  <fct> <int>
1 rna       3

Piping it all through at once:一次性全部完成：

 data %>% 
   group_by(word) %>% 
   count() %>%
   filter(word == "rna")

A one liner with data.table solution: data.table 解决方案：

library(data.table)
setDT(data)
data[word == "rna", .N, by = word]

   word N
1:  rna 3

计算 R 中的特定词频

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-05-21 20:39:50

计算 R 中的特定词频

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-05-21 20:39:50

解决方案1
0 已采纳 2020-05-21 20:39:50