簡體   English   中英

R條件查詢和

[英]R conditional lookup and sum

我有關於大學課程結業的數據,每個隊列的估計學生人數是在1、2、3,... 7年后完成的。 我想使用這些估算值來計算任何一年中每個學院和課程輸出的學生總數。

給定年份的學生產出將是1、2、3,... 7年后的前7個隊列總和。

例如,2014年,學院1,課程A輸出的學生人數等於:

Output of 2013 cohort (College 1, Course A) after 1 year +
Output of 2012 cohort (College 1, Course A) after 2 years +
Output of 2011 cohort (College 1, Course A) after 3 years +
Output of 2010 cohort (College 1, Course A) after 4 years +
Output of 2009 cohort (College 1, Course A) after 5 years +
Output of 2008 cohort (College 1, Course A) after 6 years +
Output of 2007 cohort (College 1, Course A) after 7 years +

因此,有兩個數據幀:一個包含所有輸出估計值的查找表,以及一個我正在嘗試修改的較小的摘要表。 我想用每行基於上述計算的總輸出更新dummy.summary $ output。

以下代碼將很好地復制我的數據

# Lookup table
dummy.lookup <- data.frame(cohort = rep(1998:2014, each = 210),
           college = rep(rep(paste("College", 1:6), each = 35), 17),
           course = rep(rep(paste("Course", LETTERS[1:5]), each = 7),102),
           intake = rep(sample(x = 150:300, size = 510, replace=TRUE), each = 7),
           output.year = rep(1:7, 510),
           output = sample(x = 10:20, size = 3570, replace=TRUE))


# Summary table to be modified
dummy.summary <- aggregate(x = dummy.lookup["intake"], by = list(dummy.lookup$cohort, dummy.lookup$college, dummy.lookup$course), FUN = mean)
names(dummy.summary)[1:3] <- c("year", "college", "course")
dummy.summary <- dummy.summary[order(dummy.summary$year, dummy.summary$college, dummy.summary$course), ]
dummy.summary$output <- 0

以下代碼不起作用,但是顯示了我一直在嘗試的方法。

dummy.summary $ output <-sapply(dummy.summary $ output,function(x){

    # empty vector to fill with output values
    vec <- c()

    # Find relevant output for college + course, from each cohort and exit year
    for(j in 1:7){

      append(x = vec,
             values = dummy.lookup[dummy.lookup$college==dummy.summary[x, "college"] &
                                     dummy.lookup$course==dummy.summary[x, "course"] &
                                     dummy.lookup$cohort==dummy.summary[x, "year"]-j &
                                     dummy.lookup$output.year==j, "output"])

    }

    # Sum and return total output
    sum_vec <- sum(vec)

    return(sum_vec)

  }
    )

我猜想它不起作用,因為我希望在匿名函數中使用“ x”來索引dummy.summary數據幀的特定值。 但這顯然沒有發生,並且每行僅返回零,大概是因為“ x”的起始值每次都是零。 我不知道是否有可能訪問sapply循環的每個值的索引位置 ,並使用它來索引我的摘要數據框。

這種方法是否可以解決,還是我需要一種完全不同的方法?

即使它是可修復的,是否有更優雅/更快的方法來實現我要執行的操作?

謝謝您的期待。

我剛剛更新了您的output.yearoutput.year2那里,而不是從值1至7它會基於A一年的價值cohort你。

我已經意識到, output你想要的信息對應於output.year ,但intake你想要的信息對應的cohort 因此,我分別計算它們,然后加入表/信息。 這會自動創建1998年的空白(NA轉換為0的不適用) output信息。

# fix your random sampling
set.seed(24)  

# Lookup table
dummy.lookup <- data.frame(cohort = rep(1998:2014, each = 210),
                           college = rep(rep(paste("College", 1:6), each = 35), 17),
                           course = rep(rep(paste("Course", LETTERS[1:5]), each = 7),102),
                           intake = rep(sample(x = 150:300, size = 510, replace=TRUE), each = 7),
                           output.year = rep(1:7, 510),
                           output = sample(x = 10:20, size = 3570, replace=TRUE))
dummy.lookup$output[dummy.lookup$yr %in% 1:2] <- 0


library(dplyr)


# create result table for output info
dt_output = 
  dummy.lookup %>%
  mutate(output.year2 = output.year+cohort) %>%     # update output.year to get a year value
  group_by(output.year2, college, course) %>%       # for each output year, college, course
  summarise(SumOutput = sum(output)) %>%            # calculate sum of intake
  ungroup() %>%
  arrange(college,course,output.year2) %>%          # for visualisation purposes
  rename(cohort = output.year2)                     # rename column


# create result for intake info
dt_intake =
  dummy.lookup %>%
  select(cohort, college, course, intake) %>%     # select useful columns
  distinct()                                      # keep distinct rows/values


# join info
dt_intake %>% 
  full_join(dt_output, by=c("cohort","college","course")) %>%
  mutate(SumOutput = ifelse(is.na(SumOutput),0,SumOutput)) %>%
  arrange(college,course,cohort) %>%     # for visualisation purposes
  tbl_df()       # for printing purposes


# Source: local data frame [720 x 5]
# 
# cohort   college   course intake SumOutput
# (int)    (fctr)   (fctr)  (int)     (dbl)
# 1    1998 College 1 Course A    194         0
# 2    1999 College 1 Course A    198        11
# 3    2000 College 1 Course A    223        29
# 4    2001 College 1 Course A    198        45
# 5    2002 College 1 Course A    289        62
# 6    2003 College 1 Course A    163        78
# 7    2004 College 1 Course A    211        74
# 8    2005 College 1 Course A    181       108
# 9    2006 College 1 Course A    277       101
# 10   2007 College 1 Course A    157       109
# ..    ...       ...      ...    ...       ...

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM