[英]Merging R dataframes with word counts (of unequal length) - Text mining
對於我的文本挖掘任務,我試圖創建一個包含三個單獨文本(我已經過濾和標記化)的字數的矩陣。 我知道每個文本都有這個數據框:
word count
film 82
camera 18
director 10
action 5
character 2
我還創建了一個列表,將三個文本的所有單詞組合在一起,並組合了字數,但我試圖達到這樣的目的:
word text1. text2. text3.
film. 82. 16. 8
camera. 18. 76. 3
director. 10. 2. 91
character. 2. 20. 0
screen. 0. 4. 10
movie. 12. 0. 0
action. 5. 23. 54
dance. 0. 1. 16
為此使用什么代碼? 如上面的示例所示,我想為文本中沒有出現的每個單詞填寫數字“0”。 我總共有大約 4459 個單詞,文本分別有 1804、1522 和 1133 個單詞。
非常感謝!
如果你已經數過三張桌子。 然后,您只需要對這些表進行完全合並,然后再刪除 NA。 像
library(dplyr)
first <- data.frame(word = sample(letters, 10),
count = sample(1:100, 10))
second <- data.frame(word = sample(letters, 10),
count = sample(1:100, 10))
third <- data.frame(word = sample(letters, 10),
count = sample(1:100, 10))
combined <- merge(first, second, by = "word", all = TRUE)
combined <- merge(combined, third, by = "word", all = TRUE)
combined %>%
mutate_all(.funs = function(x){
ifelse(is.na(x),0, x)
})
使用dplyr
和tidyr
解決方案
library(dplyr)
library(tidyr)
full_join(df1, df2, by = "word", suffix = c(".text1", ".text2")) %>%
full_join(., df3, by = "word") %>%
rename(count.text3 = count) %>%
mutate_at(vars(count.text1:count.text3), tidyr::replace_na, 0)
#> word count.text1 count.text2 count.text3
#> 1 film 82 16 8
#> 2 camera 18 76 3
#> 3 director 10 2 91
#> 4 action 5 23 54
#> 5 character 2 20 0
#> 6 screen 0 4 10
#> 7 dance 0 1 16
模擬您的數據示例
df1 <- data.frame(
word = c("film", "camera", "director", "action", "character"),
count = c(82, 18, 10, 5, 2)
)
df2 <- data.frame(
word = c("film", "camera", "director", "character", "screen", "action", "dance"),
count = c(16, 76, 2, 20, 4, 23, 1)
)
df3 <- data.frame(
word = c("film", "camera", "director", "screen", "action", "dance"),
count = c(8, 3, 91, 10, 54, 16)
)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.