[英]How to count frequency of rows based on values in 3 columns in Rstudio
我有成千上萬行看起來像這樣的數據
df <- data.frame(
thing_code = c("X123", "X123", "Y123", "Y123", "Y123", "Y123", "Z123", "Z123", "Z123", "Z123", "A456", "A456", "A456", "A456", "A456"),
year = c("2001", "2001", "2004", "2004", "2004", "2004", "2004", "2004", "2004", "2004", "2007", "2007", "2007", "2007", "2007"),
country = c("Vietnam", "Vietnam", "US", "US", "Singapore", "Vietnam", "Japan", "Vietnam", "Vietnam", "Cambodia", "Vietnam", "Vietnam", "Iran", "China", "Germany"))
其中我想計算國家每年對每件事情(由thing_code表示)的貢獻。 我要計算的類別是:
我希望能夠想出這樣的事情:
# thing_codeyear location freq percentage
# X123 2001 Vietnam 2 1
# Y123 2004 Vietnam 1 0.25
# Y123 2004 Non-local 2 0.5
# Y123 2004 SEAsian 1 0.25
# Z123 2004 Non-local 1 0.25
# Z123 2004 Vietnam 2 0.5
# Z123 2004 SEAsian 1 0.25
# A456 2007 Vietnam 2 0.4
# A456 2007 Non-local 3 0.6
freq 就像上述類別的計數器,而百分比只是每個類別貢獻的百分比。
到目前為止,我的代碼看起來像
Vietnam <- df %>% filter(str_detect(country, "Vietnam"))
thing_code_year <- subset(Vietnam, select=c(thing_code, year))
freq <- table(thing_code_year)
frequency <- as.data.frame(freq)
frequency <- frequency %>% filter(Freq!=0)
但這只是給了我越南的數字,我可能需要很長時間才能獲得其他類別的數字。
這應該會提供您想要的 output。 您可以使用case_when
創建一個使用上述邏輯指定location
的新變量。 接下來,您group_by
代碼、年份和新創建的location
分組,以計算每個類別在location
(越南、東南亞、非本地)中的頻率。 然后您可以按代碼和年份group_by
來計算location
中類別的百分比/比例。
library(dplyr)
df <- data.frame(
thing_code = c("X123", "X123", "Y123", "Y123", "Y123", "Y123", "Z123", "Z123", "Z123", "Z123", "A456", "A456", "A456", "A456", "A456"),
year = c("2001", "2001", "2004", "2004", "2004", "2004", "2004", "2004", "2004", "2004", "2007", "2007", "2007", "2007", "2007"),
country = c("Vietnam", "Vietnam", "US", "US", "Singapore", "Vietnam", "Japan", "Vietnam", "Vietnam", "Cambodia", "Vietnam", "Vietnam", "Iran", "China", "Germany"))
SEAsian <- c("Vietnam", "Singapore", "Cambodia")
df %>%
mutate(location = case_when(
country == "Vietnam" ~ "Vietnam",
country %in% SEAsian[SEAsian != "Vietnam"] ~ "SEAsian",
!country %in% SEAsian ~ "Non-local"
)) %>%
group_by(thing_code, year, location) %>%
summarise(freq = n()) %>%
group_by(thing_code, year) %>%
mutate(percentage = freq/sum(freq))
Output:
thing_code year location freq percentage
<fct> <fct> <chr> <int> <dbl>
1 A456 2007 Non-local 3 0.6
2 A456 2007 Vietnam 2 0.4
3 X123 2001 Vietnam 2 1
4 Y123 2004 Non-local 2 0.5
5 Y123 2004 SEAsian 1 0.25
6 Y123 2004 Vietnam 1 0.25
7 Z123 2004 Non-local 1 0.25
8 Z123 2004 SEAsian 1 0.25
9 Z123 2004 Vietnam 2 0.5
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.