简体   繁体   English

如何在R中的组中选择具有特定值的行

[英]How to select rows with certain values within a group in R

I am training myself in loops and functions in R (but am at a really basic level at the moment). 我正在R中进行循环和函数训练(但目前处于非常基本的水平)。 For a recent study, I need to prepare my data as following: 对于最近的研究,我需要准备以下数据:

I have a data set that looks like this: 我有一个数据集,看起来像这样:

dd <- read.table(text="
    event.timeline.ys     ID     year    group
1                   2     800033 2008    A
2                   1     800033 2009    A   
3                   0     800033 2010    A   
4                  -1     800033 2011    A   
5                  -2     800033 2012    A   
15                  0     800076 2008    B
16                 -1     800076 2009    B
17                  5     800100 2014    C     
18                  4     800100 2015    C   
19                  2     800100 2017    C   
20                  1     800100 2018    C   
30                  0     800125 2008    A    
31                 -1     800125 2009    A    
32                 -2     800125 2010    A", header=TRUE)

I would like to keep for each person only the last row with event.timeline.ys >= 0 (this would be row 3 for ID 800033) and the first row with event.timeline.ys < 0 (this would be row 4 for ID 800033). 我只想为每个人保留event.timeline.ys> = 0的最后一行(这是ID 800033的第3行)和event.timeline.ys <0的第一行(这就是ID的 4行) ID 800033)。 All other rows would be deleted. 所有其他行将被删除。 My final data frame should therefore contain only two rows per ID. 因此,我的最终数据帧应每个ID仅包含两行。

The person with the ID = 800100 does not have any negative values on event.timeline.ys. ID = 800100的人员的event.timeline.ys上没有任何负值。 In this case, I would like to keep only the last row with event.timeline.ys >= 0. 在这种情况下,我只想保留event.timeline.ys> = 0的最后一行。

The final data set would then look like this: 最终数据集将如下所示:

    event.timeline.ys     ID     year    group  
3                   0     800033 2010    A   
4                  -1     800033 2011    A      
15                  0     800076 2008    B
16                 -1     800076 2009    B 
20                  1     800100 2018    C   
30                  0     800125 2008    A    
31                 -1     800125 2009    A    

I thought about using a for-loop to check within each ID what the last row with event.timeline.ys >= 0 and the first row with event.timeline.ys < 0 is. 我考虑过使用for循环在每个ID中检查event.timeline.ys> = 0的最后一行和event.timeline.ys <0的第一行是什么。 However, the practical implementation in R fails. 但是,R中的实际实现失败。

Does anyone has a smart advice? 有没有人有明智的建议? I am also very open to other solutions that are not based on for-loops or similar stuff. 我也非常欢迎其他不基于for循环或类似内容的解决方案。

Here's one option making use of group_by in dplyr: 这是一个在dplyr中使用group_by的选项:

dd %>% group_by(ID, category = event.timeline.ys >= 0) %>% 
  filter(abs(event.timeline.ys) == min(abs(event.timeline.ys))) %>% 
  dplyr::select(-category) %>%
  as.data.frame

  category event.timeline.ys     ID year group
1     TRUE                 0 800033 2010     A
2    FALSE                -1 800033 2011     A
3     TRUE                 0 800076 2008     B
4    FALSE                -1 800076 2009     B
5     TRUE                 1 800100 2018     C
6     TRUE                 0 800125 2008     A
7    FALSE                -1 800125 2009     A

Here's a way to extract the indexes for the rows you are interested in with which() and row_number() 这是一种通过which()row_number()为感兴趣的行提取索引的方法

library(dplyr)

dd %>% 
  group_by(ID) %>% 
  filter(row_number() == last(which(event.timeline.ys >= 0)) | 
         row_number() == first(which(event.timeline.ys < 0)))

I think it has the benefit of reading similar to the way you described what you are after in words so hopefully that makes sense. 我认为阅读的好处类似于您用语言描述所追求的方式,因此希望这是有意义的。

Group by ID , and whether the event.timesline.ys is negative. ID分组,以及event.timesline.ys是否为负。 If it's negative, select ( slice ) the first row, otherwise select the last (ie row n() ). 如果为负,则选择( slice )第一行,否则选择最后一行(即n()行)。

library(dplyr)

dd %>% 
  mutate(neg = event.timeline.ys < 0) %>% 
  group_by(ID, neg) %>% 
  slice(if(neg[1]) 1 else n()) %>% 
  ungroup %>% 
  select(-neg)

# # A tibble: 7 x 4
#   event.timeline.ys     ID  year group
#               <int>  <int> <int> <fct>
# 1                 0 800033  2010 A    
# 2                -1 800033  2011 A    
# 3                 0 800076  2008 B    
# 4                -1 800076  2009 B    
# 5                 1 800100  2018 C    
# 6                 0 800125  2008 A    
# 7                -1 800125  2009 A   

Here is a way to do this in data.table 这是在data.table执行此操作的data.table

library(data.table)
as.data.table(dd)[, .SD[c(last(which(event.timeline.ys >= 0)),
                          first(which(event.timeline.ys < 0)))],
                  by=ID]


       ID event.timeline.ys year group
1: 800033                 0 2010     A
2: 800033                -1 2011     A
3: 800076                 0 2008     B
4: 800076                -1 2009     B
5: 800100                 1 2018     C
6: 800125                 0 2008     A
7: 800125                -1 2009     A

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM