![](/img/trans.png)
[英]Replacing observations by NA if total number of observations in a year is not enough
[英]Number of observation as a share of total observations per year
我在 R 中有以下數據框:
Year ID
1 2018 x
2 2018 x
3 2018 y
4 2018 z
5 2019 x
6 2019 x
7 2019 z
我想分別計算每年“ID”列中“x”在總觀測值中的份額。
結果應該是這樣的:
Year Share of x
2018 50 %
2019 67 %
有沒有可能用aggregate
來做,就像這樣:
aggregate(length(which(df$ID == x)) / length(df$ID), by=Year)
或任何其他功能?
假設在最終使用table
的注釋中可重現地顯示數據來計算計數,然后使用prop.table
來計算每個作為其行的比例。
prop.table(table(dat), 1)
## ID
## Year x y z
## 2018 0.5000000 0.2500000 0.2500000
## 2019 0.6666667 0.0000000 0.3333333
或者如果你想要每列的比例:
prop.table(table(dat), 2)
## ID
## Year x y z
## 2018 0.5 1.0 0.5
## 2019 0.5 0.0 0.5
關於問題的aggregate
標簽,第一種情況可以這樣完成:
aggregate(ID ~ Year, dat,
function(id) sapply(unique(dat$ID), function(x) setNames(mean(id == x), x)))
## Year ID.x ID.y ID.z
## 1 2018 0.5000000 0.2500000 0.2500000
## 2 2019 0.6666667 0.0000000 0.3333333
或同時使用aggregate
和table
:
aggregate(ID ~ Year, dat, function(x) table(x) / length(x))
## Year ID.x ID.y ID.z
## 1 2018 0.5000000 0.25 0.2500000
## 2 2019 0.6666667 0.00 0.3333333
library(dplyr)
library(tidyr)
dat %>%
count(Year, ID) %>%
group_by(Year) %>%
mutate(prop = n / sum(n)) %>%
pivot_wider(-n, names_from = "ID", values_from = "prop", values_fill = list(prop = 0))
## # A tibble: 2 x 4
## # Groups: Year [2]
## Year x y z
## <int> <dbl> <dbl> <dbl>
## 1 2018 0.5 0.25 0.25
## 2 2019 0.667 0 0.333
Lines <- " Year ID
1 2018 x
2 2018 x
3 2018 y
4 2018 z
5 2019 x
6 2019 x
7 2019 z "
dat <- read.table(text = Lines)
也許你想這樣做
dfout<- setNames(aggregate(ID~Year,df,function(v) sum(v=="x")/length(v)*100),
c("Year","Share of x"))
以至於
> dfout
Year Share of x
1 2018 50.00000
2 2019 66.66667
數據
df <-structure(list(Year = c(2018L, 2018L, 2018L, 2018L, 2019L, 2019L,
2019L), ID = c("x", "x", "y", "z", "x", "x", "z")), class = "data.frame", row.names = c(NA,
-7L))
Tidyverse 方法:
library(tidyverse)
data<- tribble(~year,~id,
2018,"x",
2018,"x",
2018,"y",
2018,"z",
2019,"x",
2019,"x",
2019,"z"
)
agg <- data %>% group_by(year,id) %>%
summarise(cnt_id = n()) %>% # count id per year
group_by(year) %>%
mutate(cnt_obs = sum(cnt_id),#count total obs per year
share = cnt_id/cnt_obs) %>%
filter(id=="x") %>%
select(year,id,share)
head(agg)
year id share
<dbl> <chr> <dbl>
1 2018 x 0.5
2 2019 x 0.667
我會爭辯說 2019y 缺失了,但仍然
library(tidyverse)
df<- tribble(~year,~id,
2018,"x",
2018,"x",
2018,"y",
2018,"z",
2019,"x",
2019,"x",
2019,"z"
)
df %>%
group_by(year,id) %>%
tally() %>%
group_by(year) %>%
mutate(prop = n/sum(n)) %>%
ungroup() %>%
select(-n) %>%
pivot_wider(names_from = id,values_from = prop) %>%
mutate_all(~ replace_na(.,replace = 0))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.