[英]R: Efficient way to sort/select values from one column that correspond to specific values from a different column within the same dataframe
I am working with a dataframe called Student_Majr2 with about 60K rows and two relevant columns: one is for an anonymized student ID number, the other is for the date/term the student declared their major (the first two below). 我正在使用一个名为Student_Majr2的数据框,该数据框具有约60K的行和两个相关的列:一个用于匿名学生ID号,另一个用于学生声明其专业的日期/学期(下面的前两个)。 Problem is that a large number of the students change their major, so for each student ID there may be more than one associated date.
问题在于,大量学生更改了他们的专业,因此对于每个学生ID,可能会有多个关联日期。 There are about 30,000 unique student IDs.
大约有30,000个唯一的学生ID。 My goal is to create a new dataframe that only has the most recent major declaration date (ie their final choice of major) for each student ID.
我的目标是创建一个新的数据框,该数据框仅具有每个学生ID的最新专业声明日期(即,他们对专业的最终选择)。 Here is the structure of the data frame:
这是数据帧的结构:
'data.frame': 59749 obs. of 5 variables:
$ studentID : int 1 2 2 2 4 4 5 6 8 8 ...
$ SGBSTDN_TERM_CODE_EFF : int 199920 199920 200040 200320 200130 200220 200140 200020 200430 200540 ...
$ SGBSTDN_MAJR_CODE_1 : chr "720" "966" "996" "906" ...
$ SGBSTDN_MAJR_CODE_CONC_1: chr "" "" "" "" ...
$ SGBSTDN_LEVL_CODE : chr "UG" "UG" "UG" "UG" ...
I have created the below script to accomplish this goal, and it is effective. 我创建了以下脚本来实现此目标,并且它是有效的。 However, it is also very inefficient and took several hours to run on a PC with corei5 processor running Windows 8.1, using R-Studio and R version 3.1.1.
但是,这也是非常低效的,并且需要花几个小时才能在装有运行Windows 8.1的corei5处理器的PC上,使用R-Studio和R版本3.1.1。 (I'm actually not sure how long it took, I went to bed after a couple hours and it was finished by morning seven hours later).
(我实际上不知道花了多长时间,几个小时后我上床睡觉,到了七个小时后才早上完成)。
I am convinced there is a more efficient way to perform this operation so I don't have to keep running scripts like these while I sleep, but I can't figure out what it is. 我相信有一种更有效的方法来执行此操作,因此我不必在睡觉时继续运行此类脚本,但是我不知道它是什么。 I would greatly appreciate any advice and assistance.
我将不胜感激任何建议和协助。
library(dplyr)
final_majr <- data.frame() # the final dataframe with final major per student ID
tbl_df(final_majr)
students <- unique(Student_Majr2$studentID) #students gets vector with all unique student ids
for (i in students) { #loop through all student id numbers
temp_majr <- data.frame() #set up temporary dataframe for each unique student id and major
tbl_df(temp_majr)
for (q in 1:nrow(Student_Majr2)) { #loop through all row numbers from student_major df
if (Student_Majr2$studentID[q] == i){ #identify rows for each student ID from top loop
temp_majr <- rbind(temp_majr, Student_Majr2[q, ]) #and add to temp_majr df
}
}
temp_majr <- arrange(temp_majr, SGBSTDN_TERM_CODE_EFF) #order the rows using dplyr package
m <- nrow(temp_majr) # m gets the total number of rows in temp_majr
final_majr <- rbind(final_majr, temp_majr[m, ]) #and here we add the bottom row to final_majr
}
Many thanks for any and all help with this script. 非常感谢您提供有关此脚本的所有帮助。 I regularly consult stackoverflow for help with programming and this is my first question/post.
我定期咨询stackoverflow以获得编程帮助,这是我的第一个问题。 Thanks for any feedback on how I can make my questions easier to understand and answer.
感谢您提供有关如何使我的问题更易于理解和回答的反馈。
A base R solution. 基本的R解决方案。 You can
order
the data and then use duplicated
to select the rows that you want. 您可以对数据进行
order
,然后使用duplicated
选择所需的行。
# some data
dat <- data.frame(studentID = c(1, 2, 2, 2, 4, 4, 5, 6, 8, 8),
SGBSTDN_TERM_CODE_EFF = c(199920, 199920, 200040, 200320, 200130, 200220, 200140, 200020, 200430, 200540),
SGBSTDN_MAJR_CODE_1 = letters[1:10])
# order data by id and latest date first
dat <- with(dat, dat[order(studentID, -SGBSTDN_TERM_CODE_EFF), ])
# select first observation
with(dat, dat[!duplicated(studentID), ])
# studentID SGBSTDN_TERM_CODE_EFF SGBSTDN_MAJR_CODE_1
# 1 1 199920 a
# 4 2 200320 d
# 6 4 200220 f
# 7 5 200140 g
# 8 6 200020 h
# 10 8 200540 j
If you want to select for each studentID
, the row that has the highest SGBSTDN_TERM_CODE_EFF
you could do, using dplyr
: 如果要为每个
studentID
选择,则可以使用dplyr
选择具有最高SGBSTDN_TERM_CODE_EFF
的行:
library(dplyr)
df %>% group_by(studentID) %>% arrange(SGBSTDN_TERM_CODE_EFF) %>%slice(n())
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.