簡體   English   中英

將 R 數據幀與字數(長度不等)合並 - 文本挖掘

[英]Merging R dataframes with word counts (of unequal length) - Text mining

對於我的文本挖掘任務,我試圖創建一個包含三個單獨文本(我已經過濾和標記化)的字數的矩陣。 我知道每個文本都有這個數據框:

word          count
film             82
camera           18
director         10
action            5
character         2

我還創建了一個列表,將三個文本的所有單詞組合在一起,並組合了字數,但我試圖達到這樣的目的:

word           text1.       text2.        text3. 
film.             82.         16.           8
camera.           18.         76.           3
director.         10.          2.           91
character.        2.           20.          0
screen.           0.           4.           10
movie.            12.          0.           0
action.           5.           23.          54
dance.            0.           1.           16

為此使用什么代碼? 如上面的示例所示,我想為文本中沒有出現的每個單詞填寫數字“0”。 我總共有大約 4459 個單詞,文本分別有 1804、1522 和 1133 個單詞。

非常感謝!

如果你已經數過三張桌子。 然后,您只需要對這些表進行完全合並,然后再刪除 NA。

library(dplyr)

first <- data.frame(word = sample(letters, 10),
                count = sample(1:100, 10))

second <- data.frame(word = sample(letters, 10),
                count = sample(1:100, 10))

third <- data.frame(word = sample(letters, 10),
                count = sample(1:100, 10))

combined <- merge(first, second, by = "word", all = TRUE)
combined <- merge(combined, third, by = "word", all = TRUE)
  
combined %>% 
  mutate_all(.funs = function(x){
    ifelse(is.na(x),0, x)
  })

使用dplyrtidyr解決方案

library(dplyr)
library(tidyr)

full_join(df1, df2, by = "word", suffix = c(".text1", ".text2")) %>%
   full_join(., df3, by = "word") %>%
   rename(count.text3 = count) %>%
   mutate_at(vars(count.text1:count.text3), tidyr::replace_na, 0)
#>        word count.text1 count.text2 count.text3
#> 1      film          82          16           8
#> 2    camera          18          76           3
#> 3  director          10           2          91
#> 4    action           5          23          54
#> 5 character           2          20           0
#> 6    screen           0           4          10
#> 7     dance           0           1          16

模擬您的數據示例

df1 <- data.frame(
   word = c("film", "camera", "director", "action", "character"),       
   count = c(82, 18, 10, 5, 2)
)

df2 <- data.frame(
   word = c("film", "camera", "director", "character", "screen", "action", "dance"),       
   count = c(16, 76, 2, 20, 4, 23, 1)
)

df3 <- data.frame(
   word = c("film", "camera", "director", "screen", "action", "dance"),       
   count = c(8, 3, 91, 10, 54, 16)
)


暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM