[英]Compute frequency list of final words in utterances of variable length
我有一個大的 dataframe 有可變size
的話語:
df <- structure(list(size = c(2, 2, 3, 3, 4, 4, 3, 3),
w1 = c("come", "why", "er", "well", "she", "well", "er", "well"),
w2 = c("on","that", "i", "not", "'s", "thanks", "super", "she"),
w3 = c(NA, NA, "can", "today", "going", "they", "cool", "can"),
w4 = c(NA,NA, NA, NA, "on", "can", NA, NA)),
row.names = c(NA, -8L), class = "data.frame")
我想將w1
中的話語初始詞與其他w
列中的所有話語最終詞與頻率列表進行比較,其中包含計數和比例。 我可以計算出話語初始詞的頻率列表:
library(dplyr)
df %>%
group_by(w1) %>%
summarise(n = n()) %>%
mutate(prop = n / sum(n)) %>%
arrange(desc(prop))
# A tibble: 5 x 3
w1 n prop
<chr> <int> <dbl>
1 well 3 0.375
2 er 2 0.25
3 come 1 0.125
4 she 1 0.125
5 why 1 0.125
但是,當它們在不同的w
列中時,如何計算最終話語的列表呢?
預期:
# A tibble: 5 x 3
w_last n prop
<chr> <int> <dbl>
1 can 3 0.375
2 on 2 0.25
3 cool 1 0.125
4 that 1 0.125
5 today 1 0.125
終於有了另一個解決方案:
df %>%
mutate(w_last = c(apply(., 1, function(x) tail(na.omit(x), 1)))) %>%
group_by(w_last) %>%
summarise(n = n()) %>%
mutate(prop = n / sum(n)) %>%
arrange(desc(prop))
tidyverse
語法中的三種方法
1您可以在不同的列中提取 final_word 並在其上創建prop.table
。 (僅在dplyr
中)
df %>% rowwise() %>%
mutate(final_word = get(paste0('w', size))) %>%
janitor::tabyl(final_word)
final_word n percent
can 3 0.375
cool 1 0.125
on 2 0.250
that 1 0.125
today 1 0.125
2重組數據位。
pivoted
了格式。size
與word_number
匹配的那些行janitor::tabyl()
生成您的 prop.table (可以在 janitor 中以有用的方式進一步格式化)df <- structure(list(size = c(2, 2, 3, 3, 4, 4, 3, 3),
w1 = c("come", "why", "er", "well", "she", "well", "er", "well"),
w2 = c("on","that", "i", "not", "'s", "thanks", "super", "she"),
w3 = c(NA, NA, "can", "today", "going", "they", "cool", "can"),
w4 = c(NA,NA, NA, NA, "on", "can", NA, NA)),
row.names = c(NA, -8L), class = "data.frame")
df
#> size w1 w2 w3 w4
#> 1 2 come on <NA> <NA>
#> 2 2 why that <NA> <NA>
#> 3 3 er i can <NA>
#> 4 3 well not today <NA>
#> 5 4 she 's going on
#> 6 4 well thanks they can
#> 7 3 er super cool <NA>
#> 8 3 well she can <NA>
library(tidyverse)
library(janitor)
df %>% pivot_longer(!size, values_drop_na = T) %>%
filter(as.numeric(substr(name, 2, nchar(name))) == size) %>%
janitor::tabyl(value)
#> value n percent
#> can 3 0.375
#> cool 1 0.125
#> on 2 0.250
#> that 1 0.125
#> today 1 0.125
由reprex package (v2.0.0) 於 2021 年 5 月 6 日創建
3順便說一句,您可以專門反轉序列,也可以從最后一列計算words
,在tidyr
中使用unite
和separate
df %>% unite('W', starts_with('w'), sep = '=', na.rm = T, remove = T) %>%
separate(W, into = paste0('w', seq_len(1 + max(str_count(.$W, '=')))), fill = 'left', sep = '=')
size w1 w2 w3 w4
1 2 <NA> <NA> come on
2 2 <NA> <NA> why that
3 3 <NA> er i can
4 3 <NA> well not today
5 4 she 's going on
6 4 well thanks they can
7 3 <NA> er super cool
8 3 <NA> well she can
您可以使用行( seq_len(nrow(df)
df
和df$size
中的值對 df 進行子集化,制作table
並計算proportions
。
tt <- table(df[-1][cbind(seq_len(nrow(df)), df$size)])
cbind(tt, proportions(tt))
# tt
#can 3 0.375
#cool 1 0.125
#on 2 0.250
#that 1 0.125
#today 1 0.125
一個基礎 R 選項
out <- rev(
stack(
prop.table(
table(apply(df, 1, function(x) tail(na.omit(x), 1)))
)
)
)
給
ind values
1 can 0.375
2 cool 0.125
3 on 0.250
4 that 0.125
5 today 0.125
如果您想以降序方式對行進行排序,您可以執行
> out[order(-out$value), ]
ind values
1 can 0.375
3 on 0.250
2 cool 0.125
4 that 0.125
5 today 0.125
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.