[英]How to select rows with certain values within a group in R
I am training myself in loops and functions in R (but am at a really basic level at the moment). 我正在R中进行循环和函数训练(但目前处于非常基本的水平)。 For a recent study, I need to prepare my data as following: 对于最近的研究,我需要准备以下数据:
I have a data set that looks like this: 我有一个数据集,看起来像这样:
dd <- read.table(text="
event.timeline.ys ID year group
1 2 800033 2008 A
2 1 800033 2009 A
3 0 800033 2010 A
4 -1 800033 2011 A
5 -2 800033 2012 A
15 0 800076 2008 B
16 -1 800076 2009 B
17 5 800100 2014 C
18 4 800100 2015 C
19 2 800100 2017 C
20 1 800100 2018 C
30 0 800125 2008 A
31 -1 800125 2009 A
32 -2 800125 2010 A", header=TRUE)
I would like to keep for each person only the last row with event.timeline.ys >= 0 (this would be row 3 for ID 800033) and the first row with event.timeline.ys < 0 (this would be row 4 for ID 800033). 我只想为每个人保留event.timeline.ys> = 0的最后一行(这是ID 800033的第3行)和event.timeline.ys <0的第一行(这就是ID的第 4行) ID 800033)。 All other rows would be deleted. 所有其他行将被删除。 My final data frame should therefore contain only two rows per ID. 因此,我的最终数据帧应每个ID仅包含两行。
The person with the ID = 800100 does not have any negative values on event.timeline.ys. ID = 800100的人员的event.timeline.ys上没有任何负值。 In this case, I would like to keep only the last row with event.timeline.ys >= 0. 在这种情况下,我只想保留event.timeline.ys> = 0的最后一行。
The final data set would then look like this: 最终数据集将如下所示:
event.timeline.ys ID year group
3 0 800033 2010 A
4 -1 800033 2011 A
15 0 800076 2008 B
16 -1 800076 2009 B
20 1 800100 2018 C
30 0 800125 2008 A
31 -1 800125 2009 A
I thought about using a for-loop to check within each ID what the last row with event.timeline.ys >= 0 and the first row with event.timeline.ys < 0 is. 我考虑过使用for循环在每个ID中检查event.timeline.ys> = 0的最后一行和event.timeline.ys <0的第一行是什么。 However, the practical implementation in R fails. 但是,R中的实际实现失败。
Does anyone has a smart advice? 有没有人有明智的建议? I am also very open to other solutions that are not based on for-loops or similar stuff. 我也非常欢迎其他不基于for循环或类似内容的解决方案。
Here's one option making use of group_by
in dplyr: 这是一个在dplyr中使用group_by
的选项:
dd %>% group_by(ID, category = event.timeline.ys >= 0) %>%
filter(abs(event.timeline.ys) == min(abs(event.timeline.ys))) %>%
dplyr::select(-category) %>%
as.data.frame
category event.timeline.ys ID year group
1 TRUE 0 800033 2010 A
2 FALSE -1 800033 2011 A
3 TRUE 0 800076 2008 B
4 FALSE -1 800076 2009 B
5 TRUE 1 800100 2018 C
6 TRUE 0 800125 2008 A
7 FALSE -1 800125 2009 A
Here's a way to extract the indexes for the rows you are interested in with which()
and row_number()
这是一种通过which()
和row_number()
为感兴趣的行提取索引的方法
library(dplyr)
dd %>%
group_by(ID) %>%
filter(row_number() == last(which(event.timeline.ys >= 0)) |
row_number() == first(which(event.timeline.ys < 0)))
I think it has the benefit of reading similar to the way you described what you are after in words so hopefully that makes sense. 我认为阅读的好处类似于您用语言描述所追求的方式,因此希望这是有意义的。
Group by ID
, and whether the event.timesline.ys
is negative. 按ID
分组,以及event.timesline.ys
是否为负。 If it's negative, select ( slice
) the first row, otherwise select the last (ie row n()
). 如果为负,则选择( slice
)第一行,否则选择最后一行(即n()
行)。
library(dplyr)
dd %>%
mutate(neg = event.timeline.ys < 0) %>%
group_by(ID, neg) %>%
slice(if(neg[1]) 1 else n()) %>%
ungroup %>%
select(-neg)
# # A tibble: 7 x 4
# event.timeline.ys ID year group
# <int> <int> <int> <fct>
# 1 0 800033 2010 A
# 2 -1 800033 2011 A
# 3 0 800076 2008 B
# 4 -1 800076 2009 B
# 5 1 800100 2018 C
# 6 0 800125 2008 A
# 7 -1 800125 2009 A
Here is a way to do this in data.table
这是在data.table
执行此操作的data.table
library(data.table)
as.data.table(dd)[, .SD[c(last(which(event.timeline.ys >= 0)),
first(which(event.timeline.ys < 0)))],
by=ID]
ID event.timeline.ys year group
1: 800033 0 2010 A
2: 800033 -1 2011 A
3: 800076 0 2008 B
4: 800076 -1 2009 B
5: 800100 1 2018 C
6: 800125 0 2008 A
7: 800125 -1 2009 A
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.