簡體   English   中英

如何根據 Rstudio 中 3 列中的值計算行的頻率

[英]How to count frequency of rows based on values in 3 columns in Rstudio

我有成千上萬行看起來像這樣的數據

df <- data.frame(
thing_code = c("X123", "X123", "Y123", "Y123", "Y123", "Y123", "Z123", "Z123", "Z123", "Z123", "A456", "A456", "A456", "A456", "A456"),
year = c("2001", "2001", "2004", "2004", "2004", "2004", "2004", "2004", "2004", "2004", "2007", "2007", "2007", "2007", "2007"),
country = c("Vietnam", "Vietnam", "US", "US", "Singapore", "Vietnam", "Japan", "Vietnam", "Vietnam", "Cambodia", "Vietnam", "Vietnam", "Iran", "China", "Germany"))

其中我想計算國家每年對每件事情(由thing_code表示)的貢獻。 我要計算的類別是:

  • 越南(本例中為當地國家)
  • SEAsian(除越南外的所有其他東南亞國家)
  • 非本地(越南和東南亞以外的其他國家/地區)

我希望能夠想出這樣的事情:

# thing_codeyear    location    freq    percentage
# X123      2001    Vietnam     2       1
# Y123      2004    Vietnam     1       0.25
# Y123      2004    Non-local   2       0.5 
# Y123      2004    SEAsian     1       0.25
# Z123      2004    Non-local   1       0.25
# Z123      2004    Vietnam     2       0.5
# Z123      2004    SEAsian     1       0.25
# A456      2007    Vietnam     2       0.4
# A456      2007    Non-local   3       0.6

freq 就像上述類別的計數器,而百分比只是每個類別貢獻的百分比。

到目前為止,我的代碼看起來像

Vietnam <- df %>% filter(str_detect(country, "Vietnam"))

thing_code_year <- subset(Vietnam, select=c(thing_code, year))
freq <- table(thing_code_year)

frequency <- as.data.frame(freq)
frequency <- frequency %>% filter(Freq!=0)

但這只是給了我越南的數字,我可能需要很長時間才能獲得其他類別的數字。

這應該會提供您想要的 output。 您可以使用case_when創建一個使用上述邏輯指定location的新變量。 接下來,您group_by代碼、年份和新創建的location分組,以計算每個類別在location (越南、東南亞、非本地)中的頻率。 然后您可以按代碼和年份group_by來計算location中類別的百分比/比例。

library(dplyr)

df <- data.frame(
  thing_code = c("X123", "X123", "Y123", "Y123", "Y123", "Y123", "Z123", "Z123", "Z123", "Z123", "A456", "A456", "A456", "A456", "A456"),
  year = c("2001", "2001", "2004", "2004", "2004", "2004", "2004", "2004", "2004", "2004", "2007", "2007", "2007", "2007", "2007"),
  country = c("Vietnam", "Vietnam", "US", "US", "Singapore", "Vietnam", "Japan", "Vietnam", "Vietnam", "Cambodia", "Vietnam", "Vietnam", "Iran", "China", "Germany"))

SEAsian <- c("Vietnam", "Singapore", "Cambodia")

df %>% 
  mutate(location = case_when(
    country == "Vietnam" ~ "Vietnam",
    country %in% SEAsian[SEAsian != "Vietnam"] ~ "SEAsian",
    !country %in% SEAsian ~ "Non-local"
  )) %>% 
  group_by(thing_code, year, location) %>% 
  summarise(freq = n()) %>% 
  group_by(thing_code, year) %>% 
  mutate(percentage = freq/sum(freq))

Output:

  thing_code year  location   freq percentage
  <fct>      <fct> <chr>     <int>      <dbl>
1 A456       2007  Non-local     3       0.6 
2 A456       2007  Vietnam       2       0.4 
3 X123       2001  Vietnam       2       1   
4 Y123       2004  Non-local     2       0.5 
5 Y123       2004  SEAsian       1       0.25
6 Y123       2004  Vietnam       1       0.25
7 Z123       2004  Non-local     1       0.25
8 Z123       2004  SEAsian       1       0.25
9 Z123       2004  Vietnam       2       0.5 

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM