计算“行”标记中的单词

Question

I'm completely new in R, so this question may seem obvious. 我是R语言的新手，所以这个问题似乎很明显。 However, I didn't manage and didn't find solution 但是，我没有管理，也没有找到解决方案

How can I count number of words within my tokens while they are lines (reviews, actually)? 当标记为线时，如何计算标记中的词数（实际上是评论）？ So, there is a dataset with reviews(reviewText) connected with ID of products(asin) 因此，有一个数据集，其中带有商品ID（asin）的reviews（reviewText）

amazonr_tidy_sent = amazonr_tidy_sent%>%unnest_tokens(word, reviewText, token = "lines") amazonr_tidy_sent = amazonr_tidy_sent %>% anti_join(stop_words)%>%ungroup()

I tried to do in the following way 我尝试通过以下方式进行

wordcounts <- amazonr_tidy_sent %>% group_by(word, asin)%>% summarize(word = n())

but it was not appropriate. 但这是不合适的。 I assume, that there is no way to count because line as a token cannot be "separated" 我认为，没有办法计数，因为不能将“作为标记的行”“分开”

Thanks a lot 非常感谢

Answer 1

You can use unnest_tokens() more than once, if it is appropriate to your analysis. 如果适合进行分析，则可以unnest_tokens()使用unnest_tokens() 。

First, you can use unnest_tokens() to get the lines that you want. 首先，您可以使用unnest_tokens()获得所需的行。 Notice that I am adding a column to keep track of the id of each line; 注意，我添加了一个列来跟踪每行的ID； you could call that whatever you want, but the important thing is to have a column that will note which line you are on. 您可以随心所欲地调用它，但重要的是要有一列记录您所处的行。

library(tidytext)
library(dplyr)
library(janeaustenr)


d <- data_frame(txt = prideprejudice)

d_lines <- d %>%
    unnest_tokens(line, txt, token = "lines") %>%
    mutate(id = row_number())

d_lines

#> # A tibble: 10,721 × 2
#>                                                                        line
#>                                                                       <chr>
#>  1                                                      pride and prejudice
#>  2                                                           by jane austen
#>  3                                                                chapter 1
#>  4  it is a truth universally acknowledged, that a single man in possession
#>  5                            of a good fortune, must be in want of a wife.
#>  6   however little known the feelings or views of such a man may be on his
#>  7 first entering a neighbourhood, this truth is so well fixed in the minds
#>  8 of the surrounding families, that he is considered the rightful property
#>  9                                 of some one or other of their daughters.
#> 10 "my dear mr. bennet," said his lady to him one day, "have you heard that
#> # ... with 10,711 more rows, and 1 more variables: id <int>

Now you can use unnest_tokens() again , but this time with words so that you will get a row for each word. 现在，您可以再次使用unnest_tokens() ，但是这次使用words这样您将为每个单词获得一行。 Notice that you still know which line each word came from. 请注意，您仍然知道每个单词来自哪一行。

d_words <- d_lines %>%
    unnest_tokens(word, line, token = "words")

d_words
#> # A tibble: 122,204 × 2
#>       id      word
#>    <int>     <chr>
#>  1     1     pride
#>  2     1       and
#>  3     1 prejudice
#>  4     2        by
#>  5     2      jane
#>  6     2    austen
#>  7     3   chapter
#>  8     3         1
#>  9     4        it
#> 10     4        is
#> # ... with 122,194 more rows

Now you can do any kind of counting you want, for example, maybe you want to know how many words each line had in it? 现在，您可以进行任何类型的计数，例如，也许您想知道每一行中有多少个单词？

d_words %>%
    count(id)

#> # A tibble: 10,715 × 2
#>       id     n
#>    <int> <int>
#>  1     1     3
#>  2     2     3
#>  3     3     2
#>  4     4    12
#>  5     5    11
#>  6     6    15
#>  7     7    13
#>  8     8    11
#>  9     9     8
#> 10    10    15
#> # ... with 10,705 more rows

Answer 2

By splitting each line using str_split we can count the number of words per line. 通过使用str_split拆分每行，我们可以计算每行的单词数。

Some example data (containing newlines and stopwords): 一些示例数据（包含换行符和停用词）：

library(dplyr)
library(tidytext)
d = data_frame(reviewText = c('1 2 3 4 5 able', '1 2\n3 4 5\n6\n7\n8\n9 10 above', '1!2', '1',
                          '!', '', '\n', '1', 'able able', 'above above', 'able', 'above'),
           asin = rep(letters, each = 2, length.out = length(reviewText)))

Counting the number of words: 计算字数：

by_line %>%
    group_by(asin) %>%
    summarize(word = sum(sapply(strsplit(word, '\\s'), length)))

   asin  word
  <chr> <int>
1     a    17
2     b     2
3     c     1
4     d     1
5     e     4

Note: in your original code most stopwords will not be removed because you split the data by line. 注意：在原始代码中，大多数停用词都不会删除，因为您是按行拆分数据的。 Only lines that are exactly a single stopword will be removed. 只有完全是单个停用词的行将被删除。

To exclude stopwords from the wordcount use this: 要从单词计数中排除停用词，请使用以下命令：

by_line %>%
    group_by(asin) %>%
    summarize(word = word %>% strsplit('\\s') %>%
                  lapply(setdiff, y = stop_words$word) %>% sapply(length) %>% sum)

   asin  word
  <chr> <int>
1     a    15
2     b     2
3     c     1
4     d     1
5     e     0
6     f     0

计算“行”标记中的单词

问题描述

2 个解决方案

解决方案1
0 已采纳 2017-05-09 19:52:42

解决方案2
0 2017-05-09 21:24:51

计算“行”标记中的单词

问题描述

2 个解决方案

解决方案1 0 已采纳 2017-05-09 19:52:42

解决方案2 0 2017-05-09 21:24:51

解决方案1
0 已采纳 2017-05-09 19:52:42

解决方案2
0 2017-05-09 21:24:51