简体   繁体   中英

How to Combine Multiple Rows Into One Using TidyText

I am looking at a novel and want to search for the appearance of characters' names throughout the book Some characters go by different names. For example, the character "Sissy Jupe" goes by "Sissy" and "Jupe". I want to combine two rows of word counts into one so I can see the tally for "Sissy Jupe".

I've looked at using sum, rbind, merge, and other approaches using the message boards, but nothing seems to work. Lots of great examples, but they aren't working.

library(tidyverse) 
library(gutenbergr)
library(tidytext)

ht <- gutenberg_download(786)

ht_chap <- ht %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
                                                 ignore_case = TRUE))))

tidy_ht <- ht_chap %>%
  unnest_tokens(word, text) %>%
  mutate(word = str_extract(word, "[a-z']+")) # preserves online letters; removes _)

ht_count <- tidy_ht %>%
  group_by(chapter) %>%
  count(word, sort = TRUE) %>%
  ungroup %>%
  complete(chapter, word,
           fill = list(n = 0)) 

gradgrind <- filter(ht_count, word == "gradgrind")
bounderby <- filter (ht_count, word == "bounderby")
sissy <- filter (ht_count, word == "sissy")

## TEST
sissy_jupe <- ht_count %>% 
  filter(word %in% c("sissy", "jupe"))

I want a single "word" item called "sissy_jupe" that tallies the n by chapter. This is close, but not it.

# A tibble: 76 x 3
   chapter word      n
     <int> <chr> <dbl>
 1       0 jupe      0
 2       0 sissy     1
 3       1 jupe      0
 4       1 sissy     0
 5       2 jupe      5
 6       2 sissy     9
 7       3 jupe      3
 8       3 sissy     1
 9       4 jupe      1
10       4 sissy     0
# … with 66 more rows

The below code should get you the needed output.

library(tidyverse)
df %>% group_by(chapter) %>% 
  mutate(n = sum(n),
         word = paste(word, collapse="_")) %>% 
  distinct(chapter, .keep_all = T)

Welcome to stackoverflow Tom. Here's an idea:

Basically, (1) find "sissy" or "jupe" in tidied tibble and replace with "sissy_jupe", (2) create ht_count as you did, (3) print results:

library(tidyverse) 
library(gutenbergr)
library(tidytext)

ht <- gutenberg_download(786)

ht_chap <- ht %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
                                                 ignore_case = TRUE))))

tidy_ht <- ht_chap %>%
  unnest_tokens(word, text) %>%
  mutate(word = str_extract(word, "[a-z']+")) # preserves online letters; removes _)

# NEW CODE START
tidy_ht <- tidy_ht %>%
  mutate(word = str_replace_all(word, "sissy|jupe", replacement = "sissy_jupe"))
# END NEW CODE

ht_count <- tidy_ht %>%
  group_by(chapter) %>%
  count(word, sort = TRUE) %>%
  ungroup %>%
  complete(chapter, word,
           fill = list(n = 0))

# NEW CODE
sissy_jupe <- ht_count %>% 
  filter(str_detect(word, "sissy_jupe"))
# END

... produces ...

# A tibble: 38 x 3
   chapter word           n
     <int> <chr>      <dbl>
 1       0 sissy_jupe     1
 2       1 sissy_jupe     0
 3       2 sissy_jupe    14
 4       3 sissy_jupe     4
 5       4 sissy_jupe     1
 6       5 sissy_jupe     5
 7       6 sissy_jupe    20
 8       7 sissy_jupe     7
 9       8 sissy_jupe     2
10       9 sissy_jupe    38
# ... with 28 more rows

Don't forget to upvote / click on the checkmark if any of our solutions helped you (feedback = better coders).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM