For my text mining assignment, I am trying to create a matrix with the word counts of three separate texts (that i already filtered and tokenized). I know have this dataframe per text:
word count
film 82
camera 18
director 10
action 5
character 2
I also created a list with all the words of the three texts combined, with the word counts combined, but i am trying to reach something like this:
word text1. text2. text3.
film. 82. 16. 8
camera. 18. 76. 3
director. 10. 2. 91
character. 2. 20. 0
screen. 0. 4. 10
movie. 12. 0. 0
action. 5. 23. 54
dance. 0. 1. 16
What codes to use for this? As shown in the example above i would like to fill in for every word where there is no occurrence in a text the number "0". I have about 4459 words in total, with the texts having respectively 1804, 1522 and 1133 words.
Thanks a lot in advance!
If you already have counted three tables. Then you just need to do a full merge of these tables and remove NAs afterwards. Like the
library(dplyr)
first <- data.frame(word = sample(letters, 10),
count = sample(1:100, 10))
second <- data.frame(word = sample(letters, 10),
count = sample(1:100, 10))
third <- data.frame(word = sample(letters, 10),
count = sample(1:100, 10))
combined <- merge(first, second, by = "word", all = TRUE)
combined <- merge(combined, third, by = "word", all = TRUE)
combined %>%
mutate_all(.funs = function(x){
ifelse(is.na(x),0, x)
})
A solution using dplyr
and tidyr
library(dplyr)
library(tidyr)
full_join(df1, df2, by = "word", suffix = c(".text1", ".text2")) %>%
full_join(., df3, by = "word") %>%
rename(count.text3 = count) %>%
mutate_at(vars(count.text1:count.text3), tidyr::replace_na, 0)
#> word count.text1 count.text2 count.text3
#> 1 film 82 16 8
#> 2 camera 18 76 3
#> 3 director 10 2 91
#> 4 action 5 23 54
#> 5 character 2 20 0
#> 6 screen 0 4 10
#> 7 dance 0 1 16
Mocking up your data example
df1 <- data.frame(
word = c("film", "camera", "director", "action", "character"),
count = c(82, 18, 10, 5, 2)
)
df2 <- data.frame(
word = c("film", "camera", "director", "character", "screen", "action", "dance"),
count = c(16, 76, 2, 20, 4, 23, 1)
)
df3 <- data.frame(
word = c("film", "camera", "director", "screen", "action", "dance"),
count = c(8, 3, 91, 10, 54, 16)
)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.