如何根據 Rstudio 中 3 列中的值計算行的頻率

Question

我有成千上萬行看起來像這樣的數據

df <- data.frame(
thing_code = c("X123", "X123", "Y123", "Y123", "Y123", "Y123", "Z123", "Z123", "Z123", "Z123", "A456", "A456", "A456", "A456", "A456"),
year = c("2001", "2001", "2004", "2004", "2004", "2004", "2004", "2004", "2004", "2004", "2007", "2007", "2007", "2007", "2007"),
country = c("Vietnam", "Vietnam", "US", "US", "Singapore", "Vietnam", "Japan", "Vietnam", "Vietnam", "Cambodia", "Vietnam", "Vietnam", "Iran", "China", "Germany"))

其中我想計算國家每年對每件事情（由thing_code表示）的貢獻。 我要計算的類別是：

越南（本例中為當地國家）
SEAsian（除越南外的所有其他東南亞國家）
非本地（越南和東南亞以外的其他國家/地區）

我希望能夠想出這樣的事情：

# thing_codeyear    location    freq    percentage
# X123      2001    Vietnam     2       1
# Y123      2004    Vietnam     1       0.25
# Y123      2004    Non-local   2       0.5 
# Y123      2004    SEAsian     1       0.25
# Z123      2004    Non-local   1       0.25
# Z123      2004    Vietnam     2       0.5
# Z123      2004    SEAsian     1       0.25
# A456      2007    Vietnam     2       0.4
# A456      2007    Non-local   3       0.6

freq 就像上述類別的計數器，而百分比只是每個類別貢獻的百分比。

到目前為止，我的代碼看起來像

Vietnam <- df %>% filter(str_detect(country, "Vietnam"))

thing_code_year <- subset(Vietnam, select=c(thing_code, year))
freq <- table(thing_code_year)

frequency <- as.data.frame(freq)
frequency <- frequency %>% filter(Freq!=0)

但這只是給了我越南的數字，我可能需要很長時間才能獲得其他類別的數字。

Answer 1

這應該會提供您想要的 output。 您可以使用case_when創建一個使用上述邏輯指定location的新變量。 接下來，您group_by代碼、年份和新創建的location分組，以計算每個類別在location （越南、東南亞、非本地）中的頻率。 然后您可以按代碼和年份group_by來計算location中類別的百分比/比例。

library(dplyr)

df <- data.frame(
  thing_code = c("X123", "X123", "Y123", "Y123", "Y123", "Y123", "Z123", "Z123", "Z123", "Z123", "A456", "A456", "A456", "A456", "A456"),
  year = c("2001", "2001", "2004", "2004", "2004", "2004", "2004", "2004", "2004", "2004", "2007", "2007", "2007", "2007", "2007"),
  country = c("Vietnam", "Vietnam", "US", "US", "Singapore", "Vietnam", "Japan", "Vietnam", "Vietnam", "Cambodia", "Vietnam", "Vietnam", "Iran", "China", "Germany"))

SEAsian <- c("Vietnam", "Singapore", "Cambodia")

df %>% 
  mutate(location = case_when(
    country == "Vietnam" ~ "Vietnam",
    country %in% SEAsian[SEAsian != "Vietnam"] ~ "SEAsian",
    !country %in% SEAsian ~ "Non-local"
  )) %>% 
  group_by(thing_code, year, location) %>% 
  summarise(freq = n()) %>% 
  group_by(thing_code, year) %>% 
  mutate(percentage = freq/sum(freq))

Output：

  thing_code year  location   freq percentage
  <fct>      <fct> <chr>     <int>      <dbl>
1 A456       2007  Non-local     3       0.6 
2 A456       2007  Vietnam       2       0.4 
3 X123       2001  Vietnam       2       1   
4 Y123       2004  Non-local     2       0.5 
5 Y123       2004  SEAsian       1       0.25
6 Y123       2004  Vietnam       1       0.25
7 Z123       2004  Non-local     1       0.25
8 Z123       2004  SEAsian       1       0.25
9 Z123       2004  Vietnam       2       0.5

如何根據 Rstudio 中 3 列中的值計算行的頻率

問題描述

1 個解決方案

解決方案1
0 已采納 2020-08-13 13:57:27

如何根據 Rstudio 中 3 列中的值計算行的頻率

問題描述

1 個解決方案

解決方案1 0 已采納 2020-08-13 13:57:27

解決方案1
0 已采納 2020-08-13 13:57:27