简体   繁体   中英

Merging R dataframes with word counts (of unequal length) - Text mining

For my text mining assignment, I am trying to create a matrix with the word counts of three separate texts (that i already filtered and tokenized). I know have this dataframe per text:

word          count
film             82
camera           18
director         10
action            5
character         2

I also created a list with all the words of the three texts combined, with the word counts combined, but i am trying to reach something like this:

word           text1.       text2.        text3. 
film.             82.         16.           8
camera.           18.         76.           3
director.         10.          2.           91
character.        2.           20.          0
screen.           0.           4.           10
movie.            12.          0.           0
action.           5.           23.          54
dance.            0.           1.           16

What codes to use for this? As shown in the example above i would like to fill in for every word where there is no occurrence in a text the number "0". I have about 4459 words in total, with the texts having respectively 1804, 1522 and 1133 words.

Thanks a lot in advance!

If you already have counted three tables. Then you just need to do a full merge of these tables and remove NAs afterwards. Like the

library(dplyr)

first <- data.frame(word = sample(letters, 10),
                count = sample(1:100, 10))

second <- data.frame(word = sample(letters, 10),
                count = sample(1:100, 10))

third <- data.frame(word = sample(letters, 10),
                count = sample(1:100, 10))

combined <- merge(first, second, by = "word", all = TRUE)
combined <- merge(combined, third, by = "word", all = TRUE)
  
combined %>% 
  mutate_all(.funs = function(x){
    ifelse(is.na(x),0, x)
  })

A solution using dplyr and tidyr

library(dplyr)
library(tidyr)

full_join(df1, df2, by = "word", suffix = c(".text1", ".text2")) %>%
   full_join(., df3, by = "word") %>%
   rename(count.text3 = count) %>%
   mutate_at(vars(count.text1:count.text3), tidyr::replace_na, 0)
#>        word count.text1 count.text2 count.text3
#> 1      film          82          16           8
#> 2    camera          18          76           3
#> 3  director          10           2          91
#> 4    action           5          23          54
#> 5 character           2          20           0
#> 6    screen           0           4          10
#> 7     dance           0           1          16

Mocking up your data example

df1 <- data.frame(
   word = c("film", "camera", "director", "action", "character"),       
   count = c(82, 18, 10, 5, 2)
)

df2 <- data.frame(
   word = c("film", "camera", "director", "character", "screen", "action", "dance"),       
   count = c(16, 76, 2, 20, 4, 23, 1)
)

df3 <- data.frame(
   word = c("film", "camera", "director", "screen", "action", "dance"),       
   count = c(8, 3, 91, 10, 54, 16)
)


The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM