[英]How to manipulate (aggregate) the data in R?
我有一個數據集,如下所示:
df <- tribble(
~id, ~price, ~number_of_book,
"1", 10, 3,
"1", 5, 1,
"2", 7, 4,
"2", 6, 2,
"2", 3, 4,
"3", 4, 1,
"4", 5, 1,
"4", 6, 1,
"5", 1, 2,
"5", 9, 3,
)
正如您在數據集中看到的,如果 id 為“1”,則有 3 本書每本書的價格為 10 美元,而 1 本書的價格為 5 美元。 基本上,我想查看每個價格區間的書籍數量的份額 (%)。 這是我想要的數據集:
df <- tribble(
~id, ~less_than_three, ~three-five, ~five-six, ~more_than_six,
"1", "0%", "25%", "0%", "75%",
"2", "0%", "40%", "20%", "40%",
"3", "0%", "100%", "0%", "0%",
"4", "0%", "50%", "50%", "0%",
"5", "40%", "0%", "0%", "60%",
)
現在,我首先對價格進行了聚類。 為此,我運行以下代碼:
out <- cut(df$price, breaks = c(0, 3, 5, 6, 10),
labels = c("<3","3-5","5-6", ">6"))
out = table(out) / sum(table(out))
但不幸的是,由於缺乏編碼知識,我無法更進一步。 你能幫我得到想要的數據嗎?
我們可以使用cut
來獲取間隔,然后使用tidyr
將數據轉換為寬格式,最后使用janitor
添加百分比。
library(dplyr)
library(tidyr)
library(janitor)
df %>%
mutate(interval = cut(price, c(0,3,5,6,Inf))) %>%
select(-price) %>%
pivot_wider(names_from = interval, values_from = number_of_book) %>%
adorn_percentages()
#> id (6,Inf] (3,5] (5,6] (0,3]
#> 1 0.75 0.25 NA NA
#> 2 0.40 NA 0.2 0.4
#> 3 NA 1.00 NA NA
#> 4 NA 0.50 0.5 NA
#> 5 0.60 NA NA 0.4
使用 dplyr,您可以添加將用於列名的列cols
。 然后你可以對每個 id 中每個 col 的書籍數量求和。 接下來,您可以通過將這些數字除以該 id 的總和來計算百分比,然后應用scales::percent
將格式設置為百分比而不是小數。 現在您只需要 pivot_wider 給出從中獲取名稱和值的變量,並重新排列列以匹配原始標簽順序。 (這比其他答案更復雜,因為它考慮了給定(id,cols/interval)對 >1 行的情況,並且看門人簡化了事情)
labels = c("less_than_three","three_to_five","five_to_six", "more_than_six")
df %>%
group_by(id, cols = cut(price, breaks = c(0, 3, 5, 6, 10), labels = labels)) %>%
summarise(n = sum(number_of_book)) %>%
group_by(id) %>%
mutate(pct = scales::percent(n/sum(n), 1)) %>%
pivot_wider(id_cols = id, names_from = cols, values_from = pct) %>%
select_at(c('id', labels)) %>%
ungroup
# # A tibble: 5 x 5
# id less_than_three three_to_five five_to_six more_than_six
# <chr> <chr> <chr> <chr> <chr>
# 1 1 NA 25% NA 75%
# 2 2 40% NA 20% 40%
# 3 3 NA 100% NA NA
# 4 4 NA 50% 50% NA
# 5 5 40% NA NA 60%
如果您想用 0% 替換 NA(我認為在這種情況下這是有意義的,並且與問題中顯示的輸出相匹配),您可以使用下面評論中提到的方法。
df %>%
group_by(id, cols = cut(price, breaks = c(0, 3, 5, 6, 10), labels = labels)) %>%
summarise(n = sum(number_of_book)) %>%
group_by(id) %>%
mutate(pct = scales::percent(n/sum(n), 1)) %>%
pivot_wider(id_cols = id, names_from = cols, values_from = pct,
values_fill = list(pct = '0%')) %>%
select_at(c('id', labels)) %>%
ungroup
# # A tibble: 5 x 5
# id less_than_three three_to_five five_to_six more_than_six
# <chr> <chr> <chr> <chr> <chr>
# 1 1 0% 57% 0% 43%
# 2 2 40% 0% 20% 40%
# 3 3 0% 100% 0% 0%
# 4 4 0% 50% 50% 0%
# 5 5 40% 0% 0% 60%
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.