計算可變長度話語中最終單詞的頻率列表

Question

我有一個大的 dataframe 有可變size的話語：

df <- structure(list(size = c(2, 2, 3, 3, 4, 4, 3, 3), 
                     w1 = c("come", "why", "er", "well", "she", "well", "er", "well"), 
                     w2 = c("on","that", "i", "not", "'s", "thanks", "super", "she"), 
                     w3 = c(NA, NA, "can", "today", "going", "they", "cool", "can"), 
                     w4 = c(NA,NA, NA, NA, "on", "can", NA, NA)), 
                row.names = c(NA, -8L), class = "data.frame")

我想將w1中的話語初始詞與其他w列中的所有話語最終詞與頻率列表進行比較，其中包含計數和比例。 我可以計算出話語初始詞的頻率列表：

library(dplyr)
df %>%
  group_by(w1) %>%
  summarise(n = n()) %>%
  mutate(prop = n / sum(n)) %>%
  arrange(desc(prop))
# A tibble: 5 x 3
  w1        n  prop
  <chr> <int> <dbl>
1 well      3 0.375
2 er        2 0.25 
3 come      1 0.125
4 she       1 0.125
5 why       1 0.125

但是，當它們在不同的w列中時，如何計算最終話語的列表呢？

預期：

# A tibble: 5 x 3
  w_last    n  prop
  <chr> <int> <dbl>
1 can       3 0.375
2 on        2 0.25 
3 cool      1 0.125
4 that      1 0.125
5 today     1 0.125

終於有了另一個解決方案：

df %>%
  mutate(w_last = c(apply(., 1, function(x) tail(na.omit(x), 1)))) %>%
  group_by(w_last) %>%
  summarise(n = n()) %>%
  mutate(prop = n / sum(n)) %>%
  arrange(desc(prop))

Answer 1

tidyverse語法中的三種方法

1您可以在不同的列中提取 final_word 並在其上創建prop.table 。 （僅在dplyr中）

df %>% rowwise() %>%
  mutate(final_word = get(paste0('w', size))) %>%
  janitor::tabyl(final_word)

final_word n percent
        can 3   0.375
       cool 1   0.125
         on 2   0.250
       that 1   0.125
      today 1   0.125

2重組數據位。

pivoted了格式。
只保留size與word_number匹配的那些行
使用janitor::tabyl()生成您的 prop.table （可以在 janitor 中以有用的方式進一步格式化）

df <- structure(list(size = c(2, 2, 3, 3, 4, 4, 3, 3), 
                     w1 = c("come", "why", "er", "well", "she", "well", "er", "well"), 
                     w2 = c("on","that", "i", "not", "'s", "thanks", "super", "she"), 
                     w3 = c(NA, NA, "can", "today", "going", "they", "cool", "can"), 
                     w4 = c(NA,NA, NA, NA, "on", "can", NA, NA)), 
                row.names = c(NA, -8L), class = "data.frame")


df
#>   size   w1     w2    w3   w4
#> 1    2 come     on  <NA> <NA>
#> 2    2  why   that  <NA> <NA>
#> 3    3   er      i   can <NA>
#> 4    3 well    not today <NA>
#> 5    4  she     's going   on
#> 6    4 well thanks  they  can
#> 7    3   er  super  cool <NA>
#> 8    3 well    she   can <NA>
library(tidyverse)
library(janitor)

df %>% pivot_longer(!size, values_drop_na = T) %>%
  filter(as.numeric(substr(name, 2, nchar(name))) == size) %>%
  janitor::tabyl(value)
#>  value n percent
#>    can 3   0.375
#>   cool 1   0.125
#>     on 2   0.250
#>   that 1   0.125
#>  today 1   0.125

^{由reprex package (v2.0.0) 於 2021 年 5 月 6 日創建}

3順便說一句，您可以專門反轉序列，也可以從最后一列計算words ，在tidyr中使用unite和separate

df %>% unite('W', starts_with('w'), sep = '=', na.rm = T, remove = T) %>%
  separate(W, into = paste0('w', seq_len(1 + max(str_count(.$W, '=')))), fill = 'left', sep = '=')

  size   w1     w2    w3    w4
1    2 <NA>   <NA>  come    on
2    2 <NA>   <NA>   why  that
3    3 <NA>     er     i   can
4    3 <NA>   well   not today
5    4  she     's going    on
6    4 well thanks  they   can
7    3 <NA>     er super  cool
8    3 <NA>   well   she   can

Answer 2

您可以使用行（ seq_len(nrow(df) df和df$size中的值對 df 進行子集化，制作table並計算proportions 。

tt <- table(df[-1][cbind(seq_len(nrow(df)), df$size)])
cbind(tt, proportions(tt))
#      tt      
#can    3 0.375
#cool   1 0.125
#on     2 0.250
#that   1 0.125
#today  1 0.125

Answer 3

一個基礎 R 選項

out <- rev(
  stack(
    prop.table(
      table(apply(df, 1, function(x) tail(na.omit(x), 1)))
    )
  )
)

給

    ind values
1   can  0.375
2  cool  0.125
3    on  0.250
4  that  0.125
5 today  0.125

如果您想以降序方式對行進行排序，您可以執行

> out[order(-out$value), ]
    ind values
1   can  0.375
3    on  0.250
2  cool  0.125
4  that  0.125
5 today  0.125

計算可變長度話語中最終單詞的頻率列表

問題描述

3 個解決方案

解決方案1
3 已采納 2021-05-06 11:39:03

解決方案2
2 2021-05-06 11:37:20

解決方案3
1 2021-05-06 11:50:41

計算可變長度話語中最終單詞的頻率列表

問題描述

3 個解決方案

解決方案1 3 已采納 2021-05-06 11:39:03

解決方案2 2 2021-05-06 11:37:20

解決方案3 1 2021-05-06 11:50:41

解決方案1
3 已采納 2021-05-06 11:39:03

解決方案2
2 2021-05-06 11:37:20

解決方案3
1 2021-05-06 11:50:41