简体   繁体   English

计算 R 中的特定词频

[英]Counting Specific Word Frequency In R

I have a data set where I have split text from the journal abstracts to print 1 word per column.我有一个数据集,其中我将期刊摘要中的文本拆分为每列打印 1 个单词。 This has lead to over 5 million rows, but I just want certain the word counts of certain words.这导致超过 500 万行,但我只想确定某些单词的字数。 Below is an example of the data:以下是数据示例:

  • 1 rna 1个核糖核酸
  • 1 synthesis 1综合
  • 1 resembles 1 类似
  • 1 copy 1 份
  • 1 choice 1 个选择
  • 1 rna 1个核糖核酸
  • 1 recombination 1 重组
  • 1 process 1 个过程
  • 1 nascent 1 新生
  • 1 rna 1个核糖核酸

So in that example let's say I want just the rna counts, I would get 3 and that's it.所以在那个例子中,假设我只想要 rna 计数,我会得到 3,就是这样。 I have done that word count on the whole set but this is not as useful to me.我已经完成了整个系列的字数计算,但这对我来说没有那么有用。

wordCount <- m3 %>% count(word, sort = TRUE) wordCount <- m3 %>% count(word, sort = TRUE)

Since many of the words aren't helpful for what I am trying to get to.由于许多单词对我想要达到的目标没有帮助。

Any help would be welcome.欢迎任何帮助。

You can group_by the word and count occurrences of each unique word and then subset the ones you want.您可以按单词group_by并计算每个唯一单词的出现次数,然后对您想要的单词进行子集化。

library(tidyverse)
data <- data.frame(word = c("rna",
                "synthesis",
                "resembles",
               "copy",
                "choice",
                "rna",
                "recombination",
               "process",
                "nascent",
                "rna"))

counts <- data %>% 
  group_by(word) %>% 
  count()

counts[which(counts$word == "rna"),]

   # A tibble: 1 x 2
# Groups:   word [1]
  word      n
  <fct> <int>
1 rna       3

or using dplyr subsetting:或使用 dplyr 子集:

 counts %>% filter(word == "rna")
# A tibble: 1 x 2
# Groups:   word [1]
  word      n
  <fct> <int>
1 rna       3

Piping it all through at once:一次性全部完成:

 data %>% 
   group_by(word) %>% 
   count() %>%
   filter(word == "rna")

A one liner with data.table solution: data.table 解决方案:

library(data.table)
setDT(data)
data[word == "rna", .N, by = word]

   word N
1:  rna 3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM