簡體   English   中英

計算可變長度話語中最終單詞的頻率列表

[英]Compute frequency list of final words in utterances of variable length

我有一個大的 dataframe 有可變size的話語:

df <- structure(list(size = c(2, 2, 3, 3, 4, 4, 3, 3), 
                     w1 = c("come", "why", "er", "well", "she", "well", "er", "well"), 
                     w2 = c("on","that", "i", "not", "'s", "thanks", "super", "she"), 
                     w3 = c(NA, NA, "can", "today", "going", "they", "cool", "can"), 
                     w4 = c(NA,NA, NA, NA, "on", "can", NA, NA)), 
                row.names = c(NA, -8L), class = "data.frame")

我想將w1中的話語初始詞與其他w列中的所有話語最終詞頻率列表進行比較,其中包含計數和比例。 我可以計算出話語初始詞的頻率列表:

library(dplyr)
df %>%
  group_by(w1) %>%
  summarise(n = n()) %>%
  mutate(prop = n / sum(n)) %>%
  arrange(desc(prop))
# A tibble: 5 x 3
  w1        n  prop
  <chr> <int> <dbl>
1 well      3 0.375
2 er        2 0.25 
3 come      1 0.125
4 she       1 0.125
5 why       1 0.125

但是,當它們在不同的w列中時,如何計算最終話語的列表呢?

預期

# A tibble: 5 x 3
  w_last    n  prop
  <chr> <int> <dbl>
1 can       3 0.375
2 on        2 0.25 
3 cool      1 0.125
4 that      1 0.125
5 today     1 0.125

終於有了另一個解決方案:

df %>%
  mutate(w_last = c(apply(., 1, function(x) tail(na.omit(x), 1)))) %>%
  group_by(w_last) %>%
  summarise(n = n()) %>%
  mutate(prop = n / sum(n)) %>%
  arrange(desc(prop))

tidyverse語法中的三種方法

1您可以在不同的列中提取 final_word 並在其上創建prop.table (僅在dplyr中)

df %>% rowwise() %>%
  mutate(final_word = get(paste0('w', size))) %>%
  janitor::tabyl(final_word)

final_word n percent
        can 3   0.375
       cool 1   0.125
         on 2   0.250
       that 1   0.125
      today 1   0.125

2重組數據位。

  • pivoted了格式。
  • 只保留sizeword_number匹配的那些行
  • 使用janitor::tabyl()生成您的 prop.table (可以在 janitor 中以有用的方式進一步格式化)
df <- structure(list(size = c(2, 2, 3, 3, 4, 4, 3, 3), 
                     w1 = c("come", "why", "er", "well", "she", "well", "er", "well"), 
                     w2 = c("on","that", "i", "not", "'s", "thanks", "super", "she"), 
                     w3 = c(NA, NA, "can", "today", "going", "they", "cool", "can"), 
                     w4 = c(NA,NA, NA, NA, "on", "can", NA, NA)), 
                row.names = c(NA, -8L), class = "data.frame")


df
#>   size   w1     w2    w3   w4
#> 1    2 come     on  <NA> <NA>
#> 2    2  why   that  <NA> <NA>
#> 3    3   er      i   can <NA>
#> 4    3 well    not today <NA>
#> 5    4  she     's going   on
#> 6    4 well thanks  they  can
#> 7    3   er  super  cool <NA>
#> 8    3 well    she   can <NA>
library(tidyverse)
library(janitor)

df %>% pivot_longer(!size, values_drop_na = T) %>%
  filter(as.numeric(substr(name, 2, nchar(name))) == size) %>%
  janitor::tabyl(value)
#>  value n percent
#>    can 3   0.375
#>   cool 1   0.125
#>     on 2   0.250
#>   that 1   0.125
#>  today 1   0.125

reprex package (v2.0.0) 於 2021 年 5 月 6 日創建


3順便說一句,您可以專門反轉序列,也可以從最后一列計算words ,在tidyr中使用uniteseparate

df %>% unite('W', starts_with('w'), sep = '=', na.rm = T, remove = T) %>%
  separate(W, into = paste0('w', seq_len(1 + max(str_count(.$W, '=')))), fill = 'left', sep = '=')

  size   w1     w2    w3    w4
1    2 <NA>   <NA>  come    on
2    2 <NA>   <NA>   why  that
3    3 <NA>     er     i   can
4    3 <NA>   well   not today
5    4  she     's going    on
6    4 well thanks  they   can
7    3 <NA>     er super  cool
8    3 <NA>   well   she   can

您可以使用行( seq_len(nrow(df) dfdf$size中的值對 df 進行子集化,制作table並計算proportions

tt <- table(df[-1][cbind(seq_len(nrow(df)), df$size)])
cbind(tt, proportions(tt))
#      tt      
#can    3 0.375
#cool   1 0.125
#on     2 0.250
#that   1 0.125
#today  1 0.125

一個基礎 R 選項

out <- rev(
  stack(
    prop.table(
      table(apply(df, 1, function(x) tail(na.omit(x), 1)))
    )
  )
)

    ind values
1   can  0.375
2  cool  0.125
3    on  0.250
4  that  0.125
5 today  0.125

如果您想以降序方式對行進行排序,您可以執行

> out[order(-out$value), ]
    ind values
1   can  0.375
3    on  0.250
2  cool  0.125
4  that  0.125
5 today  0.125

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM