简体   繁体   English

R:一种有效的方式来对一列中的值进行排序/选择,这些值对应于同一数据帧中另一列中的特定值

[英]R: Efficient way to sort/select values from one column that correspond to specific values from a different column within the same dataframe

I am working with a dataframe called Student_Majr2 with about 60K rows and two relevant columns: one is for an anonymized student ID number, the other is for the date/term the student declared their major (the first two below). 我正在使用一个名为Student_Majr2的数据框,该数据框具有约60K的行和两个相关的列:一个用于匿名学生ID号,另一个用于学生声明其专业的日期/学期(下面的前两个)。 Problem is that a large number of the students change their major, so for each student ID there may be more than one associated date. 问题在于,大量学生更改了他们的专业,因此对于每个学生ID,可能会有多个关联日期。 There are about 30,000 unique student IDs. 大约有30,000个唯一的学生ID。 My goal is to create a new dataframe that only has the most recent major declaration date (ie their final choice of major) for each student ID. 我的目标是创建一个新的数据框,该数据框仅具有每个学生ID的最新专业声明日期(即,他们对专业的最终选择)。 Here is the structure of the data frame: 这是数据帧的结构:

'data.frame':   59749 obs. of  5 variables:
 $ studentID               : int  1 2 2 2 4 4 5 6 8 8 ...
 $ SGBSTDN_TERM_CODE_EFF   : int  199920 199920 200040 200320 200130 200220 200140 200020 200430 200540 ...
 $ SGBSTDN_MAJR_CODE_1     : chr  "720" "966" "996" "906" ...
 $ SGBSTDN_MAJR_CODE_CONC_1: chr  "" "" "" "" ...
 $ SGBSTDN_LEVL_CODE       : chr  "UG" "UG" "UG" "UG" ...

I have created the below script to accomplish this goal, and it is effective. 我创建了以下脚本来实现此目标,并且它是有效的。 However, it is also very inefficient and took several hours to run on a PC with corei5 processor running Windows 8.1, using R-Studio and R version 3.1.1. 但是,这也是非常低效的,并且需要花几个小时才能在装有运行Windows 8.1的corei5处理器的PC上,使用R-Studio和R版本3.1.1。 (I'm actually not sure how long it took, I went to bed after a couple hours and it was finished by morning seven hours later). (我实际上不知道花了多长时间,几个小时后我上床睡觉,到了七个小时后才早上完成)。

I am convinced there is a more efficient way to perform this operation so I don't have to keep running scripts like these while I sleep, but I can't figure out what it is. 我相信有一种更有效的方法来执行此操作,因此我不必在睡觉时继续运行此类脚本,但是我不知道它是什么。 I would greatly appreciate any advice and assistance. 我将不胜感激任何建议和协助。

library(dplyr)
final_majr <- data.frame() # the final dataframe with final major per student ID
tbl_df(final_majr)
students <- unique(Student_Majr2$studentID) #students gets vector with all unique student ids
for (i in students) { #loop through all student id numbers
        temp_majr <- data.frame() #set up temporary dataframe for each unique student id and major
        tbl_df(temp_majr)

                for (q in 1:nrow(Student_Majr2)) { #loop through all row numbers from student_major df
                        if (Student_Majr2$studentID[q] == i){ #identify rows for each student ID from top loop 
                                temp_majr <- rbind(temp_majr, Student_Majr2[q, ]) #and add to temp_majr df
                        }
                }
        temp_majr <- arrange(temp_majr, SGBSTDN_TERM_CODE_EFF) #order the rows using dplyr package
        m <- nrow(temp_majr) # m gets the total number of rows in temp_majr
        final_majr <- rbind(final_majr, temp_majr[m, ]) #and here we add the bottom row to final_majr
}

Many thanks for any and all help with this script. 非常感谢您提供有关此脚本的所有帮助。 I regularly consult stackoverflow for help with programming and this is my first question/post. 我定期咨询stackoverflow以获得编程帮助,这是我的第一个问题。 Thanks for any feedback on how I can make my questions easier to understand and answer. 感谢您提供有关如何使我的问题更易于理解和回答的反馈。

A base R solution. 基本的R解决方案。 You can order the data and then use duplicated to select the rows that you want. 您可以对数据进行order ,然后使用duplicated选择所需的行。

# some data
dat <- data.frame(studentID = c(1, 2, 2, 2, 4, 4, 5, 6, 8, 8),
                  SGBSTDN_TERM_CODE_EFF = c(199920, 199920, 200040, 200320, 200130, 200220, 200140, 200020, 200430, 200540),
                  SGBSTDN_MAJR_CODE_1 = letters[1:10])

# order data by id and latest date first
dat <- with(dat, dat[order(studentID, -SGBSTDN_TERM_CODE_EFF), ])

# select first observation
with(dat, dat[!duplicated(studentID), ])
# studentID SGBSTDN_TERM_CODE_EFF SGBSTDN_MAJR_CODE_1
# 1          1                199920                   a
# 4          2                200320                   d
# 6          4                200220                   f
# 7          5                200140                   g
# 8          6                200020                   h
# 10         8                200540                   j

If you want to select for each studentID , the row that has the highest SGBSTDN_TERM_CODE_EFF you could do, using dplyr : 如果要为每个studentID选择,则可以使用dplyr选择具有最高SGBSTDN_TERM_CODE_EFF的行:

library(dplyr)
df %>% group_by(studentID) %>%  arrange(SGBSTDN_TERM_CODE_EFF) %>%slice(n())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 R-从同一列中选择值的唯一组合 - R - Select unique combinations of values from within the same column Select 值来自 R dataframe 列 - Select values from R dataframe column 使用R,如何根据一列以及要选择的列名称从不同的列中选择值? - Using R, how to select values from different columns depending on one column with the name of the column to select? R dplyr-根据特定值在另一列中的位置从一列中选择值 - R dplyr - select values from one column based on position of a specific value in another column 来自R中数据框中的select strsplit列的和值 - Sum values from select strsplit column in dataframe in R R-根据特定奇数列中的值替换特定偶数列中的值-适用于整个数据帧 - R - Replace values in a specific even column based on values from a odd specific column - Application to the whole dataframe 如何基于R中不同数据框中的列值从一个数据框中删除行? - How to remove rows from one dataframe based on the column values in a different data frame in R? R有效查找DataFrame列中的值 - R efficient lookup of values in DataFrame column 如何根据 R 中的 NULL 值从一列或另一列 select? - How to select from one column or the other based on NULL values in R? 在 R 的同一 dataframe 列中查找/匹配值 - Look up/match values within the same dataframe column in R
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM